From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0) Description of problem: The linux 2.4.0-7 SMP kernel always hangs under heavy load. In a 2-CPU system, there is always one CPU in 100%sys when it happens. After print out the PC, it is obvious that the CPU is looping in tcp_twkill_thr How reproducible: Always Steps to Reproduce: 1. Put a BUG_TRAP in the function tcp_twkill in net/ipv4/tcp_minisocks.c. while((tw = tcp_tw_death_row[tcp_tw_death_row_slot]) != NULL) { BUG_TRAP(tw->next_death != tw); /* added to show the problem tcp_tw_death_row[tcp_tw_death_row_slot] = tw->next_death; tw->pprev_death = NULL; tw->next_death = NULL; /* add here to prevent the system hang, just for showing the problem, it is not a fix */ spin_unlock(&tw_death_lock); tcp_timewait_kill(tw); tcp_tw_put(tw); 2. Compile the kernel and runing the system under heavy tcp load. Actual Results: The /var/log/messages has the following lines after the system running under heavy network load. Jul 17 23:01:03 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != tw) failed at tcp_minisocks.c(452):tcp_twkill__thr Jul 18 01:59:26 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != tw) failed at tcp_minisocks.c(452):tcp_twkill__thr Jul 18 04:14:41 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != tw) failed at tcp_minisocks.c(452):tcp_twkill__thr Jul 18 04:30:57 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != tw) failed at tcp_minisocks.c(452):tcp_twkill__thr Jul 18 04:49:40 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != tw) failed at tcp_minisocks.c(452):tcp_twkill__thr Jul 18 06:26:58 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != tw) failed at tcp_minisocks.c(452):tcp_twkill__thr Jul 18 06:32:26 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != tw) failed at tcp_minisocks.c(452):tcp_twkill__thr Jul 18 06:37:10 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != tw) failed at tcp_minisocks.c(452):tcp_twkill__thr Jul 18 07:55:18 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != tw) failed at tcp_minisocks.c(452):tcp_twkill__thr Jul 18 10:12:49 ce-linux-5 kernel: KERNEL: assertion (tw->next_death != tw) failed at tcp_minisocks.c(452):tcp_twkill__thr Expected Results: The BUG_TRAP should never be fired off Additional info: The actual problem is there is a race in tcp_time_wait() and tcp_v4_check_established(). Under heavy load, it could happen that after CPU0 called __tcp_tw_hashdance(tw) and before tcp_tw_schedule(tw, timeo), CPU1 calls tcp_v4_check_established() and deschedule/timewait_kill/tcp_tw_put(tw) which release the tw and put it back into the kmem_cache. After that CPU0 still calls the tcp_tw_schedule (tw) while the tw is invalid already. It can get reused and inserted into the same timewait bucket causing a circle in the linked list.
Created attachment 24867 [details] Fix to the bug
First, 2.4.0-7 kernel? What is that? :-) Second, every report I've seen like this has the person running TUX, the lockup never occurs with people not using TUX (even if TUX is compiled in and available). This makes it look like a TUX bug possible, and I'd therefore like Ingo to look at this first.
By 2.4.0-7, I meant kernel version 2.4.0 through 2.4.7. I am not running TUX at all, I was just running a multi-threaded network I/O intensive application.
Can you possibly attach the source for this test program? If I can reproduce it here I can better evaluate your fix.
Created attachment 25711 [details] Proposed fix for timewait races.
The proposed fix by davem will not fix the problem. As I discribed in the first bug report, if there is a race in between the hashdance and tcp_tw_schedule, the tw will get recycled to the kcache by the additional tcp_tw_put(tw), but the now invalid tw is still in the tinewait list due to the tcp_tw_schedule. Sorry I can not get you the application I am using to reproduce the problem. It is kind of difficult to write another program to reproduce such a subtle race condition. Rongqing
You put me in an interesting situation by saying that you can produce an OOPS yet you cannot even provide me with the test case to make this. I am very certain that my patch does in fact fix the problem. We grab an extra reference, so the access to the timewait bucket during the tcp_tw_schedule() is ALWAY VALID. We are guarenteed to still hold a refcount of ONE when that function is called, only at the tcp_tw_put() added by my patch can the timewait bucket be freed. Did you actually test a kernel with my patch applied or did you just look at the patch and say "that won't fix it" without even testing it out?
You are right, the tcp_tw_deschedule() will not decrease the refcount if it is not in the timewait list yet, the refcount will be TWO after tcp_tw_schedule(). It will make sure the tw can only be freed by either tcp_tw_deschedule() afterwards or the tcp_tw_put() you added.