When trying to avoid the zero-case in tcp_vegas_rtt_calc (upstream tcp_vegas_pkts_acked ) of an unsigned RTT sample, we just add 1 to the unsigned value, which can lead to 0-case when we have the MAX_U32 RTT sample, caused by a really rare situation or a bug in other code. In the upstream it's fixed with a) signed value b) if (RTT_SAMPLE < 0) return;, so it doesn't occur.
The codepath is:
1) When using the tcp_vegas congestion control, we have the rtt_sample set to tcp_vegas_rtt_calc function, and it's called in tcp_clean_rtx_queue with the socket and RTT difference between now and sockets.
2) In the tcp_vegas_rtt_calc we have:
123 static void tcp_vegas_rtt_calc(struct sock *sk, u32 usrtt)
125 struct vegas *vegas = inet_csk_ca(sk);
126 u32 vrtt = usrtt + 1; /* Never allow zero rtt or baseRTT */
128 /* Filter to find propagation delay: */
129 if (vrtt < vegas->baseRTT)
130 vegas->baseRTT = vrtt;
132 /* Find the min RTT during the last RTT to find
133 * the current prop. delay + queuing delay:
135 vegas->minRTT = min(vegas->minRTT, vrtt);
So if we receive the usrtt == MAX_U32, then we have minRTT == 0.
3) When the cong_avoid (tcp_vegas_cong_avoid) is called, we have:
245 rtt = vegas->minRTT;
255 target_cwnd = ((old_wnd * vegas->baseRTT)
256 << V_PARAM_SHIFT) / rtt;
So that we get a division by zero.
The customer confirmed that with the
if (vrtt == 0)
vrtt = 1;
patch in tcp_vegas_rtt_calc() the problem does not occur.
Is there anything else I can do?
Created attachment 437620 [details]
The upstream is quite heavily modified (and the logic of the caller of tcp_vegas_rtt_calc also), so that I've tried to take only the parts that affect us and our bug. In the upstream if we see that the time diff of rtt is <=0, we just return without any warnings, considering it to be just a bogus value.
Please review and say if testing is needed.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
You can download this test kernel from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.
Following steps in comment 9, got panic on -194 kernel
Code: f7 75 10 8d 14 09 8b 8b d4 04 00 00 29 c2 3b 8b d0 04 00 00
RIP [<ffffffff8855419d>] :tcp_vegas:tcp_vegas_cong_avoid+0x82/0x14d
<0>Kernel panic - not syncing: Fatal exception
Code: 66 83 7f 02 02 77 18 55 89 d8 51 8b 4c 24 08 8b 54 24 0c e8 6b 6f b9 c7 59 5b e9 c1 00 00 00 8d 0c 36 31 d2 0f af 77 08 8d 04 36 <f7> 77 04 29 c1 89 4f 10 8b 93 6c 03 00 00 3b 93 68 03 00 00 77
EIP: [<f8a5e148>] tcp_veno_cong_avoid+0xac/0x16f [tcp_veno] SS:ESP 0068:c074adcc
<0>Kernel panic - not syncing: Fatal exception in interrupt
Confirmed there was no panic on -230 kernel.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.