Bug 491549
Summary: | Netfilter NAT mishandles udp checksum 0 => 0xffff transition - also disabled HW checksum | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | David Bein <d.bein> |
Component: | kernel | Assignee: | Jiri Olsa <jolsa> |
Status: | CLOSED WONTFIX | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | high | Docs Contact: | |
Priority: | low | ||
Version: | 5.3 | CC: | anton, davem, d.bein, dzickus, jpirko, nhorman, qcai, tgraf, tis |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-02-22 08:03:44 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
David Bein
2009-03-22 22:30:49 UTC
Created attachment 336569 [details] Mongo checksum/netfilter bits from 2.6.19.7 I am attaching patches relative to 2.6.18-128.1.1 which implement 2 patches from 2.6.19.7: (1) [NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.19 .y.git;a=commitdiff_plain;h=84fa7933a33f806bbbaae6775e87459b1ec584c0;hp=8584d6df 39db5601965f9bc5e3bf2fea833ad7bb (2) [NETFILTER]: Get rid of HW checksum invalidation http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.19 .y.git;a=commitdiff_plain;h=4cf411de49c65140b3c259748629b561c0d3340f NOTES: Some of these patches are within CONFIG_XEN bits, but I have NOTES: no ready way to test the XEN code, so someone who is knowledgeable NOTES: about the XEN frontend/backend logic would need to look at these NOTES: and maybe make changes. They do compile properly. Other than NOTES: the CONFIG_XEN specific code segments, these changes have been NOTES: run and show that (a) having NAT present even w/o any rules does NOTES: not disable outbound HW checksum (setup by tcp/udp as CHECKSUM_PARTIAL). NOTES: And it fixes the problem we were seeing with missing udp 0 => 0xffff NOTES: on-the-wire transitions. Also changes to drivers not used on i686 NOTES: or x86_64 configs have not been compiled, There are only a few NOTES: of these. Because the existing code has the side-effect is disabling HW assisted outbound checksum, I am marking this as a high priority issue because that is for most people a more serious issue than the udp checksum 0 => 0xffff not happening properly. If someone wants the full modified source files, please let me know. The context diffs work as I have tested them, but they don't necessarily show enough context especially bits in CONFIG_XEN ... #endif segments. One bit of the patches sometimes can produce a bad tcp checksum in tcp_retrans_try_collapse() [ in net/ipv4/tcp_output.c ]. There is a patch reverting this one bit in 2.6.20: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=52d570aabe921663a987b2e4bae2bdc411cee480 I will add an additional patch which has this adjusted relative to the mongo patch I've already attached. Created attachment 336573 [details] Addition patch from 2.6.20 which reverts 1 bit from 2.6.19 patches This is from From: Jarek Poplawski <jarkao2> Date: Wed, 24 Jan 2007 06:07:12 +0000 (-0800) Subject: [TCP]: rare bad TCP checksum with 2.6.19 X-Git-Tag: v2.6.20-rc6~5^2~3 http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=52d570aabe921663a987b2e4bae2bdc411cee480 I understand the first issue with 'wrong' zero checksum and I'll try to reproduce it. You mentioned nfs client, could you plz be more specific about your test. I'll try smth on my own meanwhile.
I dont follow the second part about 'disabled HW checksum'. I guess I'm missing something but could you plz be more specific what's currently wrong and why is it a show stopper.
I'm pointing this message from you:
> Because the existing code has the side-effect is disabling HW assisted
> outbound checksum, I am marking this as a high priority issue because
> that is for most people a more serious issue than the udp checksum 0 => 0xffff
> not happening properly.
All the patches you attached deal with the HW checksum. I found current RHEL stream has:
commit: 84fa7933a33f806bbbaae6775e87459b1ec584c0
[NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE
I dont see the rest you post. I guess all those patches are related
to the 'disabled HW checksum' problem..
If it isn't clear from the patches, what is happening is that some outbound udp packet is setup in udp.c with CHECKSUM_HW [with the patches => CHECKSUM_PARTIAL]. It is not at that point completely checksummed and is (absent NAT) going to be handed to the outgoing nic (e.g. e1000) which will complete the checksum and if the final checksum would compute to 0, the nic would transition that to -1 (0xffff). From the tests we ran, roughly 25 to 50 packets out of 17 million (using the exact udp datagrams being sent over roughly 1 hour) will naturally checksum to 0 and if it were being done completely in software (e.g. tx checksum is disabled using ethtool -K eth0 tx off), the 0 => 0xffff conversion would happen at the udp layer itself (would be CHECKSUM_NONE as the NAT layer would see it). In this case there is no bug provided there is no matching NAT rule - in our case this would have been true. However since we are using tx checksum offloading (this is the normal case for NICs which have this feature), the partial checksum is being stepped on in ip_nat_fn() [net/ipv4/netfilter/ip_standalone_nat.c] in the following (unpatched) code segment: ct = ip_conntrack_get(*pskb, &ctinfo); /* Don't try to NAT if this packet is not conntracked */ if (ct == &ip_conntrack_untracked) return NF_ACCEPT; /* If we had a hardware checksum before, it's now invalid */ if ((*pskb)->ip_summed == CHECKSUM_HW) if (skb_checksum_help(*pskb, (out == NULL))) return NF_DROP; Even if (especially if) there are no matching NAT rules (there were not in our test environment), this fails to check for the 0 to 0xffff transition on the end of it. After skb_checksum_help() (out != NULL in this case), the checksum could be 0 and it is there that the bug of udp checksum == 0 is happening. But - the bigger problem is that it is touching the checksum which is partially setup for an outbound nic which should do the majority of the checksum calculation including the 0 -> -1 transition. Simply fixing skb_checksum_help() would be one option, but the fact that it runs at this point is disabling HW checksum offload, with a corresponding performance hit. The supplied patches (all 3 of them) applied in order do not touch the packet until/unless there is a matching NAT rule and have special handling for the partially computed checksum (after the patch ip_summed would be CHECKSUM_PARTIAL). In a nutshell, you could fix just the udp checksum == 0 problem, but the bigger problem is that without the upstream patches, no outbound packet from tcp or udp will have the checksumming completed by capable hardware. This is the real issue. Last but not least - the existing netfilter/nat code w/o the supplied patches does not do anything illegal, but it is quite sub-optimal. sending a udp packet over the wire with checksum == 0 is legal, but it should only happen if some agent (nfs or user program) has overtly disabled the checksum calculation for a given udp socket. Of the patches submitted, there is one which I did based on upstream (2.6.28) sources. The submitted patch is: --- ./drivers/net/cxgb3/sge.c.rhel5.3 2009-03-06 19:25:02.000000000 -0500 +++ ./drivers/net/cxgb3/sge.c 2009-03-24 10:32:32.000000000 -0400 @@ -1092,7 +1092,7 @@ static void write_tx_pkt_wr(struct adapt } else { cntrl |= V_TXPKT_OPCODE(CPL_TX_PKT); cntrl |= F_TXPKT_IPCSUM_DIS; /* SW calculates IP csum */ - cntrl |= V_TXPKT_L4CSUM_DIS(skb->ip_summed != CHECKSUM_HW); + cntrl |= V_TXPKT_L4CSUM_DIS(skb->ip_summed != CHECKSUM_PARTIAL)\ ; cpl->cntrl = htonl(cntrl); if (skb->len <= WR_LEN - sizeof(*cpl)) { @@ -1195,7 +1195,7 @@ int t3_eth_xmit(struct sk_buff *skb, str } /* update port statistics */ - if (skb->ip_summed == CHECKSUM_HW) + if (skb->ip_summed == CHECKSUM_COMPLETE) qs->port_stats[SGE_PSTAT_TX_CSUM]++; if (skb_shinfo(skb)->gso_size) qs->port_stats[SGE_PSTAT_TSO]++; The change from CHECKSUM_HW => CHECKSUM_COMPLETE should be CHECKSUM_HW => CHECKSUM_PARTIAL since this is outbound and is counting checksum offload tx packets. No outbound packet with the 4 possible values for ip_summed should be CHECKSUM_COMPLETE. That only has meaning in input from the driver contexts. This is incorrect IMO upstream. Maybe it has been fixed. Also - I have a few more bits in net/ipv4/netfilter which are from later patches in 2.6.19-7 which fix a few more checksum bits among other things. I will attach that patch next. Note that the patches I have included so far should not be considered as making netfilter identical to 2.6.19-7 (as that drags on a lot of other things). Created attachment 338040 [details]
A few more important bits in net/ipv4/netfilter
This fixes a few more obvious things - based on changes made by
Patrick McHardy after the initial 2 patches he made to 2.6.19.
I backprop those patches you identified to help you to the RHEL: [NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE 84fa7933a33f806bbbaae6775e87459b1ec584c0 [NETFILTER]: Get rid of HW checksum invalidation 4cf411de49c65140b3c259748629b561c0d3340f [TCP]: rare bad TCP checksum with 2.6.19 52d570aabe921663a987b2e4bae2bdc411cee480 I built and published kernel rpm and the patches on http://people.redhat.com/jolsa/bz491549. (let me know if you need other arch than published). Could you plz try it and see if it works for you.. As it is a quite extensive change, especially the first patch (and the 2nd one highly depends on it), I'm thinking about some other way to fix it, so far, following hunk from the second patch seems to me like probable fix: diff --git a/net/ipv4/netfilter/ip_nat_standalone.c b/net/ipv4/netfilter/ip_nat_standalone.c index 8686b49..163db98 100644 --- a/net/ipv4/netfilter/ip_nat_standalone.c +++ b/net/ipv4/netfilter/ip_nat_standalone.c @@ -116,11 +116,6 @@ ip_nat_fn(unsigned int hooknum, if (ct == &ip_conntrack_untracked) return NF_ACCEPT; - /* If we had a hardware checksum before, it's now invalid */ - if ((*pskb)->ip_summed == CHECKSUM_HW) - if (skb_checksum_help(*pskb, (out == NULL))) - return NF_DROP; - /* Can't track? It's not due to stress, or conntrack would have dropped it. Hence it's the user's responsibilty to packet filter it out, or implement conntrack/NAT for that thoughts? Would you plz share with me more details about your testing, like what application you use and where do you catch the bug (printk?), so I'd be able to reproduce... You also mentioned you can share the source code with changes, plz do. thanks jirka (In reply to comment #10) > I backprop those patches you identified to help you to the RHEL: > > [NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE > 84fa7933a33f806bbbaae6775e87459b1ec584c0 > > [NETFILTER]: Get rid of HW checksum invalidation > 4cf411de49c65140b3c259748629b561c0d3340f > > [TCP]: rare bad TCP checksum with 2.6.19 > 52d570aabe921663a987b2e4bae2bdc411cee480 > > > I built and published kernel rpm and the patches on > http://people.redhat.com/jolsa/bz491549. (let me know if you need other arch > than published). > > Could you plz try it and see if it works for you.. Thanks for the builds - I could test that the kernels work, but the way I diagnosed this was by adding per-cpu counters and a /proc interface to collect them at the udp layer where the outbound packets are being checksummed (partially or completely depending on (a) hardware CSO support and (b) single vs multi fragment packets. I then added the same as the layer where packets are handed to the outbound nic xmit routine for processing. It was only when I could see that with NAT present (and no rules) that the lower level was seeing none of the partial checksums and only the fully checksummed packets [only for udp] that I became suspicious and removed NAT from the mix. If you want me to test what you have, I will need the full patchset for the 138 builds you did so I can add the counters and do the testing I did to discover this and point out where it was happening. With these patches, it can be shown via counters that any higher level packet that tcp or udp generate partial checksums on are delivered to the hardware without any netfilter/nat routine disabling it. > > As it is a quite extensive change, especially the first patch (and the 2nd one > highly depends on it), I'm thinking about some other way to fix it, so far, > following hunk from the second patch seems to me like probable fix: > > diff --git a/net/ipv4/netfilter/ip_nat_standalone.c > b/net/ipv4/netfilter/ip_nat_standalone.c > index 8686b49..163db98 100644 > --- a/net/ipv4/netfilter/ip_nat_standalone.c > +++ b/net/ipv4/netfilter/ip_nat_standalone.c > @@ -116,11 +116,6 @@ ip_nat_fn(unsigned int hooknum, > if (ct == &ip_conntrack_untracked) > return NF_ACCEPT; > > - /* If we had a hardware checksum before, it's now invalid */ > - if ((*pskb)->ip_summed == CHECKSUM_HW) > - if (skb_checksum_help(*pskb, (out == NULL))) > - return NF_DROP; > - > /* Can't track? It's not due to stress, or conntrack would > have dropped it. Hence it's the user's responsibilty to > packet filter it out, or implement conntrack/NAT for that > > > thoughts? Doing this would fix it if NAT does not touch the packet, but if NAT touches the packet in any way, it has to do different things in the handling of the differential checksum depending on whether or not the packet is partially or fully checksummed. That is the nature of this fix. The first patchset is seemingly large, but conceptually it is mostly a transparent change to properly mark the ip_summed field distinctly in ways that do not confuse a packet (fragment) that was received from a nic that computed the checksum -> skb->csum and outbound partially checksummed packets. You want this because of things like bridges and I think forwarding as well. So the first patch is really a good patch because 3 states can not represent what 4 states can. From the way the 2nd patch is done, the dependency is on the distinctness of the 4 state patch because the now distinct CHECKSUM_COMPLETE value means only one thing. It would be possible to not change the outbound handling of CHECKSUM_HW to CHECKSUM_PARTIAL only changing inbound handling of it to CHECKSUM_COMPLETE, but that only means anything if there are 4 distinct values. I think for KABI reasons, you folks may not be able to use this approach to fixing this problem in any RHEL5.x release unless I am mistaken about your KABI requirements. Once RHEL6 hits the streets (whenever that is), the basis for it will be a lot newer than 2.6.18.4 and this problem will be fixed in the base. I don't expect you have KABI requirements across major releases (e.g. from RHEL5 to RHEL6). > > Would you plz share with me more details about your testing, like what > application you use and where do you catch the bug (printk?), > so I'd be able to reproduce... > > You also mentioned you can share the source code with changes, plz do. I'll dig up the counter code (relative to 128.1.1) but I think you folks will need to first decide if the basic changes to skbuff.h make it possible to fix any of this using this approach in RHEL5.x. Absent the 3 to 4 state changes there, none of the submitted patches mean anything. I think you could cut the cases down by ending up with CHECKSUM_HW being equivalent to CHECKSUM_PARTIAL (as it is now), but you still have to hunt for changes that are about inbound processing related to CHECKSUM_COMPLETE and it must be a distinct value [it isn't in the current releases]. > thanks > jirka (In reply to comment #11) > (In reply to comment #10) > > I backprop those patches you identified to help you to the RHEL: > > > > [NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE > > 84fa7933a33f806bbbaae6775e87459b1ec584c0 > > > > [NETFILTER]: Get rid of HW checksum invalidation > > 4cf411de49c65140b3c259748629b561c0d3340f > > > > [TCP]: rare bad TCP checksum with 2.6.19 > > 52d570aabe921663a987b2e4bae2bdc411cee480 > > > > > > I built and published kernel rpm and the patches on > > http://people.redhat.com/jolsa/bz491549. (let me know if you need other arch > > than published). > > > > Could you plz try it and see if it works for you.. > > Thanks for the builds - I could test that the kernels work, but > the way I diagnosed this was by adding per-cpu counters and > a /proc interface to collect them at the udp layer where the > outbound packets are being checksummed (partially or completely > depending on (a) hardware CSO support and (b) single vs multi > fragment packets. I then added the same as the layer where packets > are handed to the outbound nic xmit routine for processing. > > It was only when I could see that with NAT present (and no rules) > that the lower level was seeing none of the partial checksums > and only the fully checksummed packets [only for udp] that I > became suspicious and removed NAT from the mix. > > If you want me to test what you have, I will need the full patchset > for the 138 builds you did so I can add the counters and do the > testing I did to discover this and point out where it was happening. > With these patches, it can be shown via counters that any higher > level packet that tcp or udp generate partial checksums on are > delivered to the hardware without any netfilter/nat routine > disabling it. the full patchset is on http://people.redhat.com/jolsa/bz491549/patch/ it is based on the 138 build [jolsa@jolsa kernel]$ git describe 2.6.18-138.el5 > > > > As it is a quite extensive change, especially the first patch (and the 2nd one > > highly depends on it), I'm thinking about some other way to fix it, so far, > > following hunk from the second patch seems to me like probable fix: > > > > diff --git a/net/ipv4/netfilter/ip_nat_standalone.c > > b/net/ipv4/netfilter/ip_nat_standalone.c > > index 8686b49..163db98 100644 > > --- a/net/ipv4/netfilter/ip_nat_standalone.c > > +++ b/net/ipv4/netfilter/ip_nat_standalone.c > > @@ -116,11 +116,6 @@ ip_nat_fn(unsigned int hooknum, > > if (ct == &ip_conntrack_untracked) > > return NF_ACCEPT; > > > > - /* If we had a hardware checksum before, it's now invalid */ > > - if ((*pskb)->ip_summed == CHECKSUM_HW) > > - if (skb_checksum_help(*pskb, (out == NULL))) > > - return NF_DROP; > > - > > /* Can't track? It's not due to stress, or conntrack would > > have dropped it. Hence it's the user's responsibilty to > > packet filter it out, or implement conntrack/NAT for that > > > > > > thoughts? > > Doing this would fix it if NAT does not touch the packet, but > if NAT touches the packet in any way, it has to do different > things in the handling of the differential checksum depending > on whether or not the packet is partially or fully checksummed. > That is the nature of this fix. The first patchset is seemingly > large, but conceptually it is mostly a transparent change to > properly mark the ip_summed field distinctly in ways that do > not confuse a packet (fragment) that was received from a nic > that computed the checksum -> skb->csum and outbound partially > checksummed packets. You want this because of things like bridges > and I think forwarding as well. So the first patch is really > a good patch because 3 states can not represent what 4 states > can. From the way the 2nd patch is done, the dependency is on > the distinctness of the 4 state patch because the now distinct > CHECKSUM_COMPLETE value means only one thing. It would be possible > to not change the outbound handling of CHECKSUM_HW to CHECKSUM_PARTIAL > only changing inbound handling of it to CHECKSUM_COMPLETE, but that > only means anything if there are 4 distinct values. > > I think for KABI reasons, you folks may not be able to use this > approach to fixing this problem in any RHEL5.x release unless > I am mistaken about your KABI requirements. Once RHEL6 hits the > streets (whenever that is), the basis for it will be a lot > newer than 2.6.18.4 and this problem will be fixed in the base. > I don't expect you have KABI requirements across major releases > (e.g. from RHEL5 to RHEL6). I did some modifications so it passed the KABI check, however I need to discuss this more > > > > > Would you plz share with me more details about your testing, like what > > application you use and where do you catch the bug (printk?), > > so I'd be able to reproduce... > > > > You also mentioned you can share the source code with changes, plz do. > > > I'll dig up the counter code (relative to 128.1.1) but I think > you folks will need to first decide if the basic changes to skbuff.h > make it possible to fix any of this using this approach in RHEL5.x. > Absent the 3 to 4 state changes there, none of the submitted > patches mean anything. I think you could cut the cases down > by ending up with CHECKSUM_HW being equivalent to CHECKSUM_PARTIAL > (as it is now), but you still have to hunt for changes that are > about inbound processing related to CHECKSUM_COMPLETE and it > must be a distinct value [it isn't in the current releases]. I'd appreciate it, at least for my curiosity :) thanks > > > thanks > > jirka (In reply to comment #12) > (In reply to comment #11) > > (In reply to comment #10) > > > I backprop those patches you identified to help you to the RHEL: > > > > > > [NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE > > > 84fa7933a33f806bbbaae6775e87459b1ec584c0 > > > > > > [NETFILTER]: Get rid of HW checksum invalidation > > > 4cf411de49c65140b3c259748629b561c0d3340f > > > > > > [TCP]: rare bad TCP checksum with 2.6.19 > > > 52d570aabe921663a987b2e4bae2bdc411cee480 > > > > > > > > > I built and published kernel rpm and the patches on > > > http://people.redhat.com/jolsa/bz491549. (let me know if you need other arch > > > than published). > > > > > > Could you plz try it and see if it works for you.. > > > > Thanks for the builds - I could test that the kernels work, but > > the way I diagnosed this was by adding per-cpu counters and > > a /proc interface to collect them at the udp layer where the > > outbound packets are being checksummed (partially or completely > > depending on (a) hardware CSO support and (b) single vs multi > > fragment packets. I then added the same as the layer where packets > > are handed to the outbound nic xmit routine for processing. > > > > It was only when I could see that with NAT present (and no rules) > > that the lower level was seeing none of the partial checksums > > and only the fully checksummed packets [only for udp] that I > > became suspicious and removed NAT from the mix. > > > > If you want me to test what you have, I will need the full patchset > > for the 138 builds you did so I can add the counters and do the > > testing I did to discover this and point out where it was happening. > > With these patches, it can be shown via counters that any higher > > level packet that tcp or udp generate partial checksums on are > > delivered to the hardware without any netfilter/nat routine > > disabling it. > > the full patchset is on http://people.redhat.com/jolsa/bz491549/patch/ > it is based on the 138 build > > [jolsa@jolsa kernel]$ git describe > 2.6.18-138.el5 If you want me to test relative to 138, then what I am missing is the source rpm for 138. Is this available for testing purposes? > > > > > > As it is a quite extensive change, especially the first patch (and the 2nd one > > > highly depends on it), I'm thinking about some other way to fix it, so far, > > > following hunk from the second patch seems to me like probable fix: > > > > > > diff --git a/net/ipv4/netfilter/ip_nat_standalone.c > > > b/net/ipv4/netfilter/ip_nat_standalone.c > > > index 8686b49..163db98 100644 > > > --- a/net/ipv4/netfilter/ip_nat_standalone.c > > > +++ b/net/ipv4/netfilter/ip_nat_standalone.c > > > @@ -116,11 +116,6 @@ ip_nat_fn(unsigned int hooknum, > > > if (ct == &ip_conntrack_untracked) > > > return NF_ACCEPT; > > > > > > - /* If we had a hardware checksum before, it's now invalid */ > > > - if ((*pskb)->ip_summed == CHECKSUM_HW) > > > - if (skb_checksum_help(*pskb, (out == NULL))) > > > - return NF_DROP; > > > - > > > /* Can't track? It's not due to stress, or conntrack would > > > have dropped it. Hence it's the user's responsibilty to > > > packet filter it out, or implement conntrack/NAT for that > > > > > > > > > thoughts? > > > > Doing this would fix it if NAT does not touch the packet, but > > if NAT touches the packet in any way, it has to do different > > things in the handling of the differential checksum depending > > on whether or not the packet is partially or fully checksummed. > > That is the nature of this fix. The first patchset is seemingly > > large, but conceptually it is mostly a transparent change to > > properly mark the ip_summed field distinctly in ways that do > > not confuse a packet (fragment) that was received from a nic > > that computed the checksum -> skb->csum and outbound partially > > checksummed packets. You want this because of things like bridges > > and I think forwarding as well. So the first patch is really > > a good patch because 3 states can not represent what 4 states > > can. From the way the 2nd patch is done, the dependency is on > > the distinctness of the 4 state patch because the now distinct > > CHECKSUM_COMPLETE value means only one thing. It would be possible > > to not change the outbound handling of CHECKSUM_HW to CHECKSUM_PARTIAL > > only changing inbound handling of it to CHECKSUM_COMPLETE, but that > > only means anything if there are 4 distinct values. > > > > I think for KABI reasons, you folks may not be able to use this > > approach to fixing this problem in any RHEL5.x release unless > > I am mistaken about your KABI requirements. Once RHEL6 hits the > > streets (whenever that is), the basis for it will be a lot > > newer than 2.6.18.4 and this problem will be fixed in the base. > > I don't expect you have KABI requirements across major releases > > (e.g. from RHEL5 to RHEL6). > > I did some modifications so it passed the KABI check, > however I need to discuss this more I am curious how you work around the KABI for 3 vs 4 distinct states since there are drivers that use all 4 of the defines. Granted they are in different contexts, so *maybe* there is a way to differentiate between _HW in the outbound case and _HW in the inbound case (what should be _COMPLETE). The original code did not need this, but relied on certain assumptions that I think may break the forwarding/bridging case. > > > > > > > > > Would you plz share with me more details about your testing, like what > > > application you use and where do you catch the bug (printk?), > > > so I'd be able to reproduce... > > > > > > You also mentioned you can share the source code with changes, plz do. > > > > > > I'll dig up the counter code (relative to 128.1.1) but I think > > you folks will need to first decide if the basic changes to skbuff.h > > make it possible to fix any of this using this approach in RHEL5.x. > > Absent the 3 to 4 state changes there, none of the submitted > > patches mean anything. I think you could cut the cases down > > by ending up with CHECKSUM_HW being equivalent to CHECKSUM_PARTIAL > > (as it is now), but you still have to hunt for changes that are > > about inbound processing related to CHECKSUM_COMPLETE and it > > must be a distinct value [it isn't in the current releases]. > > I'd appreciate it, at least for my curiosity :) thanks I'll prepare an attachment relative to 128.1.1 w/o any of the patches I have already submitted for this. It will take me a little time. > > > > > > thanks > > > jirka I just pulled the 138 base from Don Zickus's http directory so I could see your patches in full. Not changing skb_checksum_help() from 2 to 1 arg fixes part of your KABI needs, but what about the effective change from CHECKSUM_COMPLETE from 1 (aliased to CHECKSUM_HW) to a distinct value of 3? Any drivers that were correctly using this before would be busted w/o a re-compile if used with a kernel that has the changes in skbuff.h in it. Now that I have your patch set, I will generate the udp debugging I added relative to the patched version in 138. There are other bits that are changed in udp.c I think, but my patch will be added on top of your patch set. If the src rpm from Don's tree for 138 is NOT what you used, please let me know, but from the output of folding in your patch set, it seems like it is the same. The changes that you've made look basically fine and I certainly see why you have preserved the nat cheat ip check functions etc. However I think that Patrick McHardy is probably the person who should see what you've changed (not in ./drivers, but in ./net) where it deviates from the patches I submitted relative to rhel5.3 (-128.1.1) because other than the XEN stuff (which I couldn't adequately test), all of the changes were really based on his changes. The only bits I added to the mix would be including all drivers in my search for CHECKSUM_HW conversions and dealing with bits added by RH having to do with XEN. Also - if you are going to use these, I recommend the 4th attachment because it makes the source in netfilter self-consistent. This is separate from how you handled the route me harder bits. (In reply to comment #15) > The changes that you've made look basically fine and I certainly see > why you have preserved the nat cheat ip check functions etc. However > I think that Patrick McHardy is probably the person who should see > what you've changed (not in ./drivers, but in ./net) where it deviates > from the patches I submitted relative to rhel5.3 (-128.1.1) because > other than the XEN stuff (which I couldn't adequately test), all of > the changes were really based on his changes. The only bits I added > to the mix would be including all drivers in my search for CHECKSUM_HW > conversions and dealing with bits added by RH having to do with XEN. > > Also - if you are going to use these, I recommend the 4th attachment > because it makes the source in netfilter self-consistent. This > is separate from how you handled the route me harder bits. In looking more closely, I see what you did with a local version of the route me harder functionality in ipt_REJECT.c. I guess that should work as well, but you are a better judge of that. Sometime in the next week, I'll find some time to try this full set of patches (as written) with debugging added to it to verify that the basic problem of NAT stomping on the checksums is fixed. I didn't have a test case for the udp checksum == 0 case, but code inspection and the fact that the same mongo test on 2.6.19-7 was not failing is sufficient for me to know that part works properly. The more interesting test is the checksum offloading which I can do pretty easily on one of the test machines (Nehalem server). Note that since 138 isn't yet a release (beta or otherwise) if there are bits that are unrelated that stick me, I'll let you know - but am not expecting that to be the case and won't file a new bug against it since I can see that 139 is already there. (In reply to comment #14) > I just pulled the 138 base from Don Zickus's http directory so I could see your > patches in full. Not changing skb_checksum_help() from 2 to 1 arg fixes part of > your KABI needs, but what about the effective change from CHECKSUM_COMPLETE > from 1 > (aliased to CHECKSUM_HW) to a distinct value of 3? Any drivers that were > correctly using this before would be busted w/o a re-compile if used with > a kernel that has the changes in skbuff.h in it. > You're right, I discussed with Neil Horman and this actually looks as a big problem. I did not realized that even this change passed the automated KABI check, it actually breaks the KABI. My current understanding is that either you'd need to wait for another release which will have this one fixed from upstream, or we need to find another solution. > Now that I have your patch set, I will generate the udp debugging I added > relative to the patched version in 138. There are other bits that are > changed in udp.c I think, but my patch will be added on top of your > patch set. > > If the src rpm from Don's tree for 138 is NOT what you used, please > let me know, but from the output of folding in your patch set, it seems > like it is the same. you're right I used 138. I'm trying to reproduce the bug. Plz check attachments and find the send application, which prepare/send the UDP packet with 0xFFFF checksum and the kernel patch (138 based) that printks any 0 checksum occurance. Plz let me know if I miss something.. Created attachment 340300 [details]
generates/sends a udp packet with checksum 0xFFFF
Makefile:
---
CC=gcc
send: send.o
Created attachment 340301 [details]
patch to detect the zero checksum packet
based on kernel-2.6.18-138.el5
(In reply to comment #17) > (In reply to comment #14) > > I just pulled the 138 base from Don Zickus's http directory so I could see your > > patches in full. Not changing skb_checksum_help() from 2 to 1 arg fixes part of > > your KABI needs, but what about the effective change from CHECKSUM_COMPLETE > > from 1 > > (aliased to CHECKSUM_HW) to a distinct value of 3? Any drivers that were > > correctly using this before would be busted w/o a re-compile if used with > > a kernel that has the changes in skbuff.h in it. > > > > You're right, I discussed with Neil Horman and this actually looks as a big > problem. I did not realized that even this change passed the automated KABI > check, it actually breaks the KABI. > > My current understanding is that either you'd need to wait for another > release which will have this one fixed from upstream, or we need to > find another solution. We can brew our own kernels in any case (including drivers). For other people who do not use the nat module, this isn't a problem at all. There ought to be ways to be able to tell the "original" direction of a packet and based on that know the distinction between the 2 values. One way to do that is tag the skb's that come upstream (from a driver). I don't know if there any unused bitfields in the skb, but I think there are. Since all skb's come from the kernel at least for non-Xen (maybe Xen too - I didn't count the unused bits) this could be possible (based on the path of the packet). But that's a hack - maybe a worthwhile hack. All upstream packets have to be recognizable. i.e. if it tagged by the driver as hw/partial it can be converted on the fly to complete (distinct value). Then the KABI issue should go away provided all the inbound paths are covered. <snip> Your udp checksum 0 program is handy. Thank you. As mentioned before I will generate a 138 based kernel and brew the full counters that show before/after patching behavior. Your test for the checksum 0 case is one piece of it and hooking it into dst_output() is a good place to do it. I actually put my low level counter at the nic layer (igb) so I could really see if it ever received any partial checksum packets. It should be noted that tiny tcp packets (e.g syn,fin,ack-only,rst) are checksummed by the host, so you have to take this into account. But by counting the number of none or hw/partial packets (outbound) it was clear what is happening. <snip>
I was able to recreate the problem with the 'send' tool and the additional kernel patch (attached test-code.patch) -> I saw ZERO checksum packets in the proc file and also in the tcpdump from the UDP peer server.
I also verified the fix we have so far is working properly - no ZERO checksum packets in the proc file and the proper tcpdump
<snip>
> There ought to be ways to be able to tell the "original" direction
> of a packet and based on that know the distinction between the
> 2 values. One way to do that is tag the skb's that come upstream
> (from a driver). I don't know if there any unused bitfields in the
> skb, but I think there are. Since all skb's come from the kernel
> at least for non-Xen (maybe Xen too - I didn't count the unused bits)
> this could be possible (based on the path of the packet). But that's
> a hack - maybe a worthwhile hack. All upstream packets have to
> be recognizable. i.e. if it tagged by the driver as hw/partial
> it can be converted on the fly to complete (distinct value). Then
> the KABI issue should go away provided all the inbound paths are
> covered.
I'll look more closely on this possibility, but I'm guessing that we will
probably end up with the patch of the same magnitude
Created attachment 340644 [details]
patch to detect the zero checksum packet and display counts in proc file
based on kernel-2.6.18-139.el5
(In reply to comment #21) > <snip> > > I was able to recreate the problem with the 'send' tool and the additional > kernel patch (attached test-code.patch) -> I saw ZERO checksum packets in the > proc file and also in the tcpdump from the UDP peer server. > > I also verified the fix we have so far is working properly - no ZERO checksum > packets in the proc file and the proper tcpdump > > > <snip> > > > There ought to be ways to be able to tell the "original" direction > > of a packet and based on that know the distinction between the > > 2 values. One way to do that is tag the skb's that come upstream > > (from a driver). I don't know if there any unused bitfields in the > > skb, but I think there are. Since all skb's come from the kernel > > at least for non-Xen (maybe Xen too - I didn't count the unused bits) > > this could be possible (based on the path of the packet). But that's > > a hack - maybe a worthwhile hack. All upstream packets have to > > be recognizable. i.e. if it tagged by the driver as hw/partial > > it can be converted on the fly to complete (distinct value). Then > > the KABI issue should go away provided all the inbound paths are > > covered. > > I'll look more closely on this possibility, but I'm guessing that we will > probably end up with the patch of the same magnitude I discussed with Neil Horman and it seems there's no easy way to tell the packet direction out of the sk_buff: <email> > Those flags are used for telling the packet's direction and card's capability > to compute the HW checksum. > > Is there any other way we could get the packet direction out of the sk_buff? > If there is, we could probably bypass the first patch and stay KABI compliant. > > thoughts? > Nothing in the skbuff thats definitive, no, unfortunately. Neil </email> So I guess the best would be to push this bug out to the RHEL 6. (In reply to comment #23) > (In reply to comment #21) > > <snip> > > > > I was able to recreate the problem with the 'send' tool and the additional > > kernel patch (attached test-code.patch) -> I saw ZERO checksum packets in the > > proc file and also in the tcpdump from the UDP peer server. > > > > I also verified the fix we have so far is working properly - no ZERO checksum > > packets in the proc file and the proper tcpdump > > > > > > <snip> > > > > > There ought to be ways to be able to tell the "original" direction > > > of a packet and based on that know the distinction between the > > > 2 values. One way to do that is tag the skb's that come upstream > > > (from a driver). I don't know if there any unused bitfields in the > > > skb, but I think there are. Since all skb's come from the kernel > > > at least for non-Xen (maybe Xen too - I didn't count the unused bits) > > > this could be possible (based on the path of the packet). But that's > > > a hack - maybe a worthwhile hack. All upstream packets have to > > > be recognizable. i.e. if it tagged by the driver as hw/partial > > > it can be converted on the fly to complete (distinct value). Then > > > the KABI issue should go away provided all the inbound paths are > > > covered. > > > > I'll look more closely on this possibility, but I'm guessing that we will > > probably end up with the patch of the same magnitude > > > I discussed with Neil Horman and it seems there's no easy way to tell > the packet direction out of the sk_buff: > > <email> > > Those flags are used for telling the packet's direction and card's capability > > to compute the HW checksum. > > > > Is there any other way we could get the packet direction out of the sk_buff? > > If there is, we could probably bypass the first patch and stay KABI compliant. > > > > thoughts? > > > Nothing in the skbuff thats definitive, no, unfortunately. > Neil > </email> > > So I guess the best would be to push this bug out to the RHEL 6. What I was thinking about is the following: A network device receives a hardware packet. If it marks it with CHECKSUM_NONE or CHECKSUM_UNNECESSARY nothing needs to change. Only if it uses the ABI value for HW (PARTIAL) does anything need to happen. Since the packets have to ultimately be handed to netif_receive_skb() [or variants of that] those functions could transform CHECKSUM_PARTIAL => CHECKSUM_COMPLETE. Since netif_receive_skb() is in the kernel [along with variants of it for LRO and vlan], it could handle this transformation. If that is feasible, then drivers compiled with the wrong value of CHECKSUM_COMPLETE [aliased to CHECKSUM_HW => CHECKSUM_PARTIAL] would work w/o breaking the KABI. proposing to close as wontfix due to KABI issues (In reply to comment #26) > proposing to close as wontfix due to KABI issues I expected this to be final resolution - when RHEL6 comes out I expect the base 2.6 release to be changed to something newer than 2.6.18 and in that case this should be fixed w/o doing anything. |