Bug 608641

Summary: vegas and veno possible division by zero bug
Product: Red Hat Enterprise Linux 5 Reporter: Veaceslav Falico <vfalico>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED ERRATA QA Contact: Eryu Guan <eguan>
Severity: high Docs Contact:
Priority: urgent    
Version: 5.3CC: eguan, nhorman, peterm, rbinkhor, tao
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-13 21:39:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
proposed patch none

Description Veaceslav Falico 2010-06-28 11:21:09 UTC
When trying to avoid the zero-case in tcp_vegas_rtt_calc (upstream tcp_vegas_pkts_acked ) of an unsigned RTT sample, we just add 1 to the unsigned value, which can lead to 0-case when we have the MAX_U32 RTT sample, caused by a really rare situation or a bug in other code. In the upstream it's fixed with a) signed value b) if (RTT_SAMPLE < 0) return;, so it doesn't occur.

The codepath is:

1) When using the tcp_vegas congestion control, we have the rtt_sample set to tcp_vegas_rtt_calc function, and it's called in tcp_clean_rtx_queue with the socket and RTT difference between now and sockets.

2) In the tcp_vegas_rtt_calc we have:

123 static void tcp_vegas_rtt_calc(struct sock *sk, u32 usrtt)
124 {   
125     struct vegas *vegas = inet_csk_ca(sk);
126     u32 vrtt = usrtt + 1; /* Never allow zero rtt or baseRTT */
127 
128     /* Filter to find propagation delay: */
129     if (vrtt < vegas->baseRTT)
130         vegas->baseRTT = vrtt;
131 
132     /* Find the min RTT during the last RTT to find
133      * the current prop. delay + queuing delay:
134      */
135     vegas->minRTT = min(vegas->minRTT, vrtt);

So if we receive the usrtt == MAX_U32, then we have minRTT == 0.

3) When the cong_avoid (tcp_vegas_cong_avoid) is called, we have:

245             rtt = vegas->minRTT;
<snip comments>
255             target_cwnd = ((old_wnd * vegas->baseRTT)
256                        << V_PARAM_SHIFT) / rtt;

So that we get a division by zero.

Comment 2 Veaceslav Falico 2010-08-02 10:47:47 UTC
Hello,

The customer confirmed that with the 

if (vrtt == 0)
    vrtt = 1;

patch in tcp_vegas_rtt_calc() the problem does not occur.

Is there anything else I can do?

Thank you!

Comment 3 Veaceslav Falico 2010-08-09 15:17:05 UTC
Created attachment 437620 [details]
proposed patch

Hello,

The upstream is quite heavily modified (and the logic of the caller of tcp_vegas_rtt_calc also), so that I've tried to take only the parts that affect us and our bug. In the upstream if we see that the time diff of rtt is <=0, we just return without any warnings, considering it to be just a bogus value.

Please review and say if testing is needed.

Thank you!

Comment 5 Neil Horman 2010-08-11 19:38:03 UTC
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2676539
test build

Comment 6 RHEL Product and Program Management 2010-08-12 18:19:22 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 12 Jarod Wilson 2010-08-31 01:17:49 UTC
in kernel-2.6.18-214.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 14 Eryu Guan 2010-11-09 06:32:18 UTC
Following steps in comment 9, got panic on -194 kernel 
Code: f7 75 10 8d 14 09 8b 8b d4 04 00 00 29 c2 3b 8b d0 04 00 00 
RIP  [<ffffffff8855419d>] :tcp_vegas:tcp_vegas_cong_avoid+0x82/0x14d
 RSP <ffff810001757b00>
 hit
<0>Kernel panic - not syncing: Fatal exception

Code: 66 83 7f 02 02 77 18 55 89 d8 51 8b 4c 24 08 8b 54 24 0c e8 6b 6f b9 c7 59 5b e9 c1 00 00 00 8d 0c 36 31 d2 0f af 77 08 8d 04 36 <f7> 77 04 29 c1 89 4f 10 8b 93 6c 03 00 00 3b 93 68 03 00 00 77
EIP: [<f8a5e148>] tcp_veno_cong_avoid+0xac/0x16f [tcp_veno] SS:ESP 0068:c074adcc
 <0>Kernel panic - not syncing: Fatal exception in interrupt

Confirmed there was no panic on -230 kernel.

Comment 16 errata-xmlrpc 2011-01-13 21:39:46 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html