Bug 49910

Summary: Linux 2.4.x SMP kernel loops in tcp_twkill__thr under heavy network load
Product: [Retired] Red Hat Linux Reporter: rlu
Component: kernelAssignee: David Miller <davem>
Status: CLOSED WORKSFORME QA Contact: Brock Organ <borgan>
Severity: high Docs Contact:
Priority: high    
Version: 7.1CC: rlu
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2001-08-30 18:27:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Fix to the bug
none
Proposed fix for timewait races. none

Description rlu 2001-07-25 00:17:00 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)

Description of problem:
The linux 2.4.0-7 SMP kernel always hangs under heavy load. In a 2-CPU 
system, there is always one CPU in 100%sys when it happens. After print 
out the PC, it is obvious that the CPU is looping in tcp_twkill_thr

How reproducible:
Always

Steps to Reproduce:
1. Put a BUG_TRAP in the function tcp_twkill in net/ipv4/tcp_minisocks.c.
        while((tw = tcp_tw_death_row[tcp_tw_death_row_slot]) != NULL) {
	        BUG_TRAP(tw->next_death != tw); /* added to show the 
problem
                tcp_tw_death_row[tcp_tw_death_row_slot] = tw->next_death;
                tw->pprev_death = NULL;
                tw->next_death = NULL; /* add here to prevent the system 
hang, just for showing the problem, it is not a fix */
                spin_unlock(&tw_death_lock);

                tcp_timewait_kill(tw);
                tcp_tw_put(tw);

2. Compile the kernel and runing the system under heavy tcp load. 

	

Actual Results:  The /var/log/messages has the following lines after the 
system running under heavy network load.

Jul 17 23:01:03 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != 
tw) failed at tcp_minisocks.c(452):tcp_twkill__thr
Jul 18 01:59:26 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != 
tw) failed at tcp_minisocks.c(452):tcp_twkill__thr
Jul 18 04:14:41 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != 
tw) failed at tcp_minisocks.c(452):tcp_twkill__thr
Jul 18 04:30:57 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != 
tw) failed at tcp_minisocks.c(452):tcp_twkill__thr
Jul 18 04:49:40 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != 
tw) failed at tcp_minisocks.c(452):tcp_twkill__thr
Jul 18 06:26:58 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != 
tw) failed at tcp_minisocks.c(452):tcp_twkill__thr
Jul 18 06:32:26 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != 
tw) failed at tcp_minisocks.c(452):tcp_twkill__thr
Jul 18 06:37:10 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != 
tw) failed at tcp_minisocks.c(452):tcp_twkill__thr
Jul 18 07:55:18 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != 
tw) failed at tcp_minisocks.c(452):tcp_twkill__thr
Jul 18 10:12:49 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != 
tw) failed at tcp_minisocks.c(452):tcp_twkill__thr



Expected Results:  The BUG_TRAP should never be fired off

Additional info:

The actual problem is there is a race in tcp_time_wait() and 
tcp_v4_check_established(). Under heavy load, it could happen that after 
CPU0 called __tcp_tw_hashdance(tw) and before tcp_tw_schedule(tw, timeo), 
CPU1 calls tcp_v4_check_established() and 
deschedule/timewait_kill/tcp_tw_put(tw) which release the tw and put it 
back into the kmem_cache. After that CPU0 still calls the tcp_tw_schedule
(tw) while the tw is invalid already. It can get reused and inserted into 
the same timewait bucket causing a circle in the linked list.

Comment 1 rlu 2001-07-25 00:19:56 UTC
Created attachment 24867 [details]
Fix to the bug

Comment 2 David Miller 2001-07-31 01:53:03 UTC
First, 2.4.0-7 kernel?  What is that? :-)

Second, every report I've seen like this has the person running
TUX, the lockup never occurs with people not using TUX (even if
TUX is compiled in and available).  This makes it look like a TUX
bug possible, and I'd therefore like Ingo to look at this first.


Comment 3 rlu 2001-07-31 05:19:10 UTC
By 2.4.0-7, I meant kernel version 2.4.0 through 2.4.7. I am not running TUX at 
all, I was just running a multi-threaded network I/O intensive application.

Comment 4 David Miller 2001-07-31 15:04:14 UTC
Can you possibly attach the source for this test program?  If I can reproduce it
here I can better evaluate your fix.


Comment 5 David Miller 2001-08-01 05:08:25 UTC
Created attachment 25711 [details]
Proposed fix for timewait races.

Comment 6 rlu 2001-08-30 18:27:20 UTC
The proposed fix by davem will not fix the problem. As I discribed in the first 
bug report, if there is a race in between the hashdance and tcp_tw_schedule,
the tw will get recycled to the kcache by the additional tcp_tw_put(tw), but 
the now invalid tw is still in the tinewait list due to the tcp_tw_schedule.

Sorry I can not get you the application I am using to reproduce the problem. It 
is kind of difficult to write another program to reproduce such a subtle race 
condition.

Rongqing


Comment 7 David Miller 2001-08-30 19:22:58 UTC
You put me in an interesting situation by saying that you
can produce an OOPS yet you cannot even provide me with the
test case to make this.

I am very certain that my patch does in fact fix the problem.
We grab an extra reference, so the access to the timewait
bucket during the tcp_tw_schedule() is ALWAY VALID.  We are
guarenteed to still hold a refcount of ONE when that function
is called, only at the tcp_tw_put() added by my patch can the
timewait bucket be freed.

Did you actually test a kernel with my patch applied or did you
just look at the patch and say "that won't fix it" without even
testing it out?


Comment 8 rlu 2001-08-30 20:47:11 UTC
You are right, the tcp_tw_deschedule() will not decrease the refcount if it is 
not in the timewait list yet, the refcount will be TWO after tcp_tw_schedule(). 
It will make sure the tw can only be freed by either tcp_tw_deschedule() 
afterwards or the tcp_tw_put() you added.