| Summary: | RHEL6.4: kernel 2.6.32-358.el6 panic, divide error with RIP bictcp_cong_avoid called from tcp_ack in TCP stack | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Marcelo Ricardo Leitner <mleitner> |
| Component: | kernel | Assignee: | Jesper Brouer <jbrouer> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 6.4 | CC: | ivecera, jbrouer, kdube, nchavan, plakhera, rkhan, todayyang |
| Target Milestone: | rc | ||
| Target Release: | 6.6 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | kernel-2.6.32-386.el6 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-12-11 08:43:22 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Marcelo Ricardo Leitner
2013-08-21 18:30:22 UTC
Hi Prashant, If that's easy to reproduce, please check with them if disabling GRO works around this issue. Thanks. When looking into the upstream code changes in this area, I found some commits that looked interesting, only to find that these commits had already been backported in bug#920794 to RHEL6.5. These fixes are avail in kernel-2.6.32-375.el6. I request that the customer get this kernel version to try on his system --Jesper (In reply to Jesper Brouer from comment #7) > When looking into the upstream code changes in this area, I found some > commits that looked interesting, only to find that these commits had already > been backported in bug#920794 to RHEL6.5. > > These fixes are avail in kernel-2.6.32-375.el6. > > I request that the customer get this kernel version to try on his system > --Jesper Jesper, the test has been already requested. Customer seems fine with testing but we have to ship that package to them (in progress). But uh, could you share the commit list you thought relevant in there? I'd love to know where you're targeting at :) Thanks, Marcelo (In reply to Marcelo Ricardo Leitner from comment #8) > But uh, could you share the commit list you thought relevant in there? I'd > love to know where you're targeting at :) It was mostly a hunch to a fix of a div by zero due, caused by snd_cwnd being 0, which is fixed in: https://bugzilla.redhat.com/show_bug.cgi?id=920794#c8 Related patch: http://patchwork.lab.bos.redhat.com/patch/57483 Related patch: http://patchwork.lab.bos.redhat.com/patch/57490 But it does not correlate with your crash dump analysis... so I might be wrong... (In reply to Marcelo Ricardo Leitner from comment #0) > > After the commit in BZ 832203, I can't see how ca->delayed_ack can reach 0. Yes, I was also under that assumption, but looking closer at the upstream commit b9f47a3aae (tcp_cubic: limit delayed_ack ratio to prevent divide error), which states it fixes div-by-zero, I think it contains an error... commit b9f47a3aaeabdce3b42829bbb27765fa340f76ba Author: stephen hemminger <shemminger> Date: Wed May 4 10:04:56 2011 +0000 tcp_cubic: limit delayed_ack ratio to prevent divide error TCP Cubic keeps a metric that estimates the amount of delayed acknowledgements to use in adjusting the window. If an abnormally large number of packets are acknowledged at once, then the update could wrap and reach zero. This kind of ACK could only happen when there was a large window and huge number of ACK's were lost. This patch limits the value of delayed ack ratio. The choice of 32 is just a conservative value since normally it should be range of 1 to 4 packets. Signed-off-by: Stephen Hemminger <shemminger> Signed-off-by: David S. Miller <davem> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c index 34340c9..f376b05 100644 --- a/net/ipv4/tcp_cubic.c +++ b/net/ipv4/tcp_cubic.c @@ -93,6 +93,7 @@ struct bictcp { u32 ack_cnt; /* number of acks */ u32 tcp_cwnd; /* estimated tcp cwnd */ #define ACK_RATIO_SHIFT 4 +#define ACK_RATIO_LIMIT (32u << ACK_RATIO_SHIFT) u16 delayed_ack; /* estimate the ratio of Packets/ACKs << 4 */ u8 sample_cnt; /* number of samples to decide curr_rtt */ u8 found; /* the exit point is found? */ @@ -398,8 +399,12 @@ static void bictcp_acked(struct sock *sk, u32 cnt, s32 rtt_us) u32 delay; if (icsk->icsk_ca_state == TCP_CA_Open) { - cnt -= ca->delayed_ack >> ACK_RATIO_SHIFT; - ca->delayed_ack += cnt; + u32 ratio = ca->delayed_ack; + + ratio -= ca->delayed_ack >> ACK_RATIO_SHIFT; + ratio += cnt; + + ca->delayed_ack = min(ratio, ACK_RATIO_LIMIT); } It seems wrong to use min(), to avoid ca->delayed_ack reaching zero, shouldn't it be max(). Or perhaps I'm misunderstanding the code, as I'm a bit uncertain about what this code path achieves... Thanks for sharing that, Jesper. That's exactly why I asked for it: that code is way not clear for me too. Thanks! In Customer Case 00893760 (closed); it was verified/confirmed that kernel-2.6.32-386.el6 solved their crash. Closing ths BZ (In reply to Marcelo Ricardo Leitner from comment #11) > Thanks for sharing that, Jesper. That's exactly why I asked for it: that > code is way not clear for me too. Thanks! this issue happened when apply this patch. that code is way not clear for me too that avoid the fixes div-by-zero (In reply to todayyang from comment #25) > (In reply to Marcelo Ricardo Leitner from comment #11) > > Thanks for sharing that, Jesper. That's exactly why I asked for it: that > > code is way not clear for me too. Thanks! > > this issue happened when apply this patch. > > that code is way not clear for me too that avoid the fixes div-by-zero Sorry, I don't understand what you mean here. You are having this issue (div-by-zero) after applying this patch? |