Bug 608641

Summary:

vegas and veno possible division by zero bug

Product:

Red Hat Enterprise Linux 5

Reporter:

Veaceslav Falico <vfalico>

Component:

kernel

Assignee:

Neil Horman <nhorman>

Status:

CLOSED ERRATA

QA Contact:

Eryu Guan <eguan>

Severity:

high

Docs Contact:

Priority:

urgent

Version:

5.3

CC:

eguan, nhorman, peterm, rbinkhor, tao

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-01-13 21:39:46 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
proposed patch	none

Description Veaceslav Falico 2010-06-28 11:21:09 UTC

When trying to avoid the zero-case in tcp_vegas_rtt_calc (upstream tcp_vegas_pkts_acked ) of an unsigned RTT sample, we just add 1 to the unsigned value, which can lead to 0-case when we have the MAX_U32 RTT sample, caused by a really rare situation or a bug in other code. In the upstream it's fixed with a) signed value b) if (RTT_SAMPLE < 0) return;, so it doesn't occur.

The codepath is:

1) When using the tcp_vegas congestion control, we have the rtt_sample set to tcp_vegas_rtt_calc function, and it's called in tcp_clean_rtx_queue with the socket and RTT difference between now and sockets.

2) In the tcp_vegas_rtt_calc we have:

123 static void tcp_vegas_rtt_calc(struct sock *sk, u32 usrtt)
124 {   
125     struct vegas *vegas = inet_csk_ca(sk);
126     u32 vrtt = usrtt + 1; /* Never allow zero rtt or baseRTT */
127 
128     /* Filter to find propagation delay: */
129     if (vrtt < vegas->baseRTT)
130         vegas->baseRTT = vrtt;
131 
132     /* Find the min RTT during the last RTT to find
133      * the current prop. delay + queuing delay:
134      */
135     vegas->minRTT = min(vegas->minRTT, vrtt);

So if we receive the usrtt == MAX_U32, then we have minRTT == 0.

3) When the cong_avoid (tcp_vegas_cong_avoid) is called, we have:

245             rtt = vegas->minRTT;
<snip comments>
255             target_cwnd = ((old_wnd * vegas->baseRTT)
256                        << V_PARAM_SHIFT) / rtt;

So that we get a division by zero.

Comment 2 Veaceslav Falico 2010-08-02 10:47:47 UTC

Hello,

The customer confirmed that with the 

if (vrtt == 0)
    vrtt = 1;

patch in tcp_vegas_rtt_calc() the problem does not occur.

Is there anything else I can do?

Thank you!

Comment 3 Veaceslav Falico 2010-08-09 15:17:05 UTC

Created attachment 437620 [details]
proposed patch

Hello,

The upstream is quite heavily modified (and the logic of the caller of tcp_vegas_rtt_calc also), so that I've tried to take only the parts that affect us and our bug. In the upstream if we see that the time diff of rtt is <=0, we just return without any warnings, considering it to be just a bogus value.

Please review and say if testing is needed.

Thank you!

Comment 5 Neil Horman 2010-08-11 19:38:03 UTC

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2676539
test build

Comment 6 RHEL Program Management 2010-08-12 18:19:22 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 12 Jarod Wilson 2010-08-31 01:17:49 UTC

in kernel-2.6.18-214.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 14 Eryu Guan 2010-11-09 06:32:18 UTC

Following steps in comment 9, got panic on -194 kernel 
Code: f7 75 10 8d 14 09 8b 8b d4 04 00 00 29 c2 3b 8b d0 04 00 00 
RIP  [<ffffffff8855419d>] :tcp_vegas:tcp_vegas_cong_avoid+0x82/0x14d
 RSP <ffff810001757b00>
 hit
<0>Kernel panic - not syncing: Fatal exception

Code: 66 83 7f 02 02 77 18 55 89 d8 51 8b 4c 24 08 8b 54 24 0c e8 6b 6f b9 c7 59 5b e9 c1 00 00 00 8d 0c 36 31 d2 0f af 77 08 8d 04 36 <f7> 77 04 29 c1 89 4f 10 8b 93 6c 03 00 00 3b 93 68 03 00 00 77
EIP: [<f8a5e148>] tcp_veno_cong_avoid+0xac/0x16f [tcp_veno] SS:ESP 0068:c074adcc
 <0>Kernel panic - not syncing: Fatal exception in interrupt

Confirmed there was no panic on -230 kernel.

Comment 16 errata-xmlrpc 2011-01-13 21:39:46 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html