Bug 49910
Summary: | Linux 2.4.x SMP kernel loops in tcp_twkill__thr under heavy network load | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | rlu | ||||||
Component: | kernel | Assignee: | David Miller <davem> | ||||||
Status: | CLOSED WORKSFORME | QA Contact: | Brock Organ <borgan> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 7.1 | CC: | rlu | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2001-08-30 18:27:24 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
rlu
2001-07-25 00:17:00 UTC
Created attachment 24867 [details]
Fix to the bug
First, 2.4.0-7 kernel? What is that? :-) Second, every report I've seen like this has the person running TUX, the lockup never occurs with people not using TUX (even if TUX is compiled in and available). This makes it look like a TUX bug possible, and I'd therefore like Ingo to look at this first. By 2.4.0-7, I meant kernel version 2.4.0 through 2.4.7. I am not running TUX at all, I was just running a multi-threaded network I/O intensive application. Can you possibly attach the source for this test program? If I can reproduce it here I can better evaluate your fix. Created attachment 25711 [details]
Proposed fix for timewait races.
The proposed fix by davem will not fix the problem. As I discribed in the first bug report, if there is a race in between the hashdance and tcp_tw_schedule, the tw will get recycled to the kcache by the additional tcp_tw_put(tw), but the now invalid tw is still in the tinewait list due to the tcp_tw_schedule. Sorry I can not get you the application I am using to reproduce the problem. It is kind of difficult to write another program to reproduce such a subtle race condition. Rongqing You put me in an interesting situation by saying that you can produce an OOPS yet you cannot even provide me with the test case to make this. I am very certain that my patch does in fact fix the problem. We grab an extra reference, so the access to the timewait bucket during the tcp_tw_schedule() is ALWAY VALID. We are guarenteed to still hold a refcount of ONE when that function is called, only at the tcp_tw_put() added by my patch can the timewait bucket be freed. Did you actually test a kernel with my patch applied or did you just look at the patch and say "that won't fix it" without even testing it out? You are right, the tcp_tw_deschedule() will not decrease the refcount if it is not in the timewait list yet, the refcount will be TWO after tcp_tw_schedule(). It will make sure the tw can only be freed by either tcp_tw_deschedule() afterwards or the tcp_tw_put() you added. |