Bug 161168

Summary: System freezes when I seed multiple torrents at once.
Product: [Fedora] Fedora Reporter: Nathan G. Grennan <redhat-bugzilla>
Component: kernelAssignee: Dave Jones <davej>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4CC: davej, pfrields, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-06-29 04:49:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
policy routing init script none

Description Nathan G. Grennan 2005-06-20 22:44:57 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050513 Galeon/1.3.21

Description of problem:
I have a server at colo. When I start seeding two torrents at once the system will freeze within 5 minutes. I have tried replacing hardware, and it didn't help. Most of the time it just freezes, but I have twice seen an oops. I did try using a serial console, but was unsuccessful in collecting the oops information. 
I also tried netconsole, but the remote computer didn't see any information from the computer having the problem.

I seems to have narrowed it down to the kernel. If I downgrade to a FC3 kernel like 2.6.11-1.27 or 2.6.11-1.35, it works, instead of seconds or minutes. I also tried going to the latest development kernel, 2.6.11-1.1383. It still froze. In addition I tried 2.6.11-1.1226 from FC4T2 and 2.6.11-1.1286 from FC4T3, both froze.

I suspect it relates to a change from 2.6.11.X to 2.6.12rcX. I am curious how a kernel based on the final release of 2.6.12 will react.

Version-Release number of selected component (if applicable):
kernel-2.6.11-1.1369_FC4

How reproducible:
Always

Steps to Reproduce:
1. btdownloadcurses.py --url 'url'
2. btdownloadcurses.py --url 'url'

  

Actual Results:  Within a few minutes or seconds it will freeze or oops

Expected Results:  Doesn't freeze or oops

Additional info:

I am willing to try new ways to collection oops information, or other kernels.

Comment 1 Nathan G. Grennan 2005-06-20 22:53:33 UTC
The setup is special in a few ways. One is that it is using a script to do same
in/same out. So incoming connections go back out the correct interface and
multiple networks are acceptable. In addition the outbound is setup for round
robin. So the first connection goes out eth0, and the next goes out eth1. Being
at a colo, instead of a normal 3/256 type connect it has two 10/10 connections.

Another way it is special is that it has four network cards. Two on one network,
and the others on their own networks. One onboard and three pci cards. I tried
didn't brands of network card, and hence different drivers with the same effect.
Drivers I tried were e100, eepro100, tulip, and 8139too.

One of the interfaces is also in promucious mode for use with ntop, a network
traffic monitor.

I cap the torrents at 300k/s, but suspect it may be more disk access based. The
seed files are hundreds of megs, and sometimes the system locks up within a few
seconds of starting the second torrent.

Comment 2 Nathan G. Grennan 2005-06-21 15:07:25 UTC
I tried recompiling 2.6.11-1.1369 with 3.2.3 instead of 4.0.0 after seeing
plenty of discussion on if it was wise to compile the kernel with 4.0.0. Many
suggested waiting for 4.0.1. After many hours(boy does compiling UP, SMP, XenU,
and Xen0 take a while) it compiled, and this morning I gave it a try. It still
froze on me.

Comment 3 Nathan G. Grennan 2005-06-21 15:51:00 UTC
I have tried kernel-2.6.12-1.1387_FC5 on the system and it seems to fix the
freeze problem. It would be nice to see a similar kernel officially released for
FC4.

Comment 4 Nathan G. Grennan 2005-06-21 16:32:54 UTC
Well, I thought it was fixed. It took a lot longer this time, but it did lock up
again with 2.6.12-1.1387.

Comment 5 Nathan G. Grennan 2005-06-23 03:51:04 UTC
Here is an Oops that was triggered in the same way. Note, this came from a form
of 2.6.12-1.1369 that was recompiled after being stripped down to the least
number of patches I could. The idea was to see if it was Fedora only or not.
Reducing the number of patches seemed to have helped cause a Oops more than just
a freeze.

Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
00000000
*pde = 05646067
Oops: 0000 [#1]
Modules linked in: md5 ipv6 dm_mod video button battery ac loop uhci_hcd
ehci_hcd shpchp hw_random i2c_i801 i2c_core snd_intel8x0 snd_ac97_codec
snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss
snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc e100 mii floppy
ext3 jbd
CPU:    0
EIP:    0060:[<00000000>]    Not tainted VLI
EFLAGS: 00010246   (2.6.11-1.1369_FC4.root)
EIP is at _stext+0x3feffdd8/0x8
eax: cc697a00   ebx: 00000000   ecx: 00000000   edx: cc697a00
esi: 00000000   edi: c0471a20   ebp: bff4dfe8   esp: c044bfd4
ds: 007b   es: 007b   ss: 0068
Process genpkgmetadata. (pid: 18703, threadinfo=c044b000 task=cd223aa0)
Stack: c0138fe2 00000000 c0470ec8 0000000a c0127f19 00000001 c0127cee c0f5ef98
       00000046 00000000 c01055c9
Call Trace:
 [<c0138fe2>] rcu_do_batch+0x1a/0x57
 [<c0127f19>] tasklet_action+0x32/0x5d
 [<c0127cee>] __do_softirq+0x3e/0x8a
 [<c01055c9>] do_softirq+0x3e/0x42
 =======================
 [<c01054c4>] do_IRQ+0x51/0x82
 [<c01039be>] common_interrupt+0x1a/0x20
Code:  Bad EIP value.
 <0>Kernel panic - not syncing: Fatal exception in interrupt



Here is part of a previous Oops with the same kernel.

EFLAGS: 00010246   (2.6.11-1.1369_FC4.root)
EIP is at _stext+0x3feffdd8/0x8
eax: d6dd7900   ebx: 00000000   ecx: 00000000   edx: d6dd7900
esi: 00000000   edi: c0471a20   ebp: 00dbd035   esp: c044bfd4
ds: 007b   es: 007b   ss: 0068
Process ntop (pid: 2157, threadinfo=c044b000 task=c17dbaa0)
Stack: c0138fe2 00000000 c0470ec8 0000000a c0127f19 00000001 c0127cee dcaa1f24
       00000046 00000000 c01055c9
Call Trace:
 [<c0138fe2>] rcu_do_batch+0x1a/0x57
 [<c0127f19>] tasklet_action+0x32/0x5d
 [<c0127cee>] __do_softirq+0x3e/0x8a
 [<c01055c9>] do_softirq+0x3e/0x42
 =======================
 [<c01054c4>] do_IRQ+0x51/0x82
 [<c01039be>] common_interrupt+0x1a/0x20
 [<c014007b>] posix_cpu_timer_set+0x6c1/0x8c5
 [<c0111fc0>] get_offset_pmtmr+0x17/0x1057
 [<c0107e66>] do_gettimeofday+0x1a/0xc4
 [<c01271bf>] sys_time+0xf/0x30
 [<c01037a7>] sysenter_past_esp+0x54/0x75
Code:  Bad EIP value.
 <0>Kernel panic - not syncing: Fatal exception in interrupt


Comment 6 Nathan G. Grennan 2005-06-23 15:36:03 UTC
Here is Oops output from 2.6.12-1.1392_FC5:

Unable to handle kernel paging request at virtual address 6b6b6c17
 printing eip:
c0326820
*pde = 00000000
Oops: 0000 [#1]
Modules linked in: md5 ipv6 dm_mod video button battery ac loop uhci_hcd
ehci_hcd shpchp hw_randCPU:    0
EIP:    0060:[<c0326820>]    Not tainted VLI
EFLAGS: 00010202   (2.6.12-1.1392_FC5)
EIP is at __ip_route_output_key+0x58/0x106
eax: bf432a40   ebx: 6b6b6b6b   ecx: ec070bff   edx: d6033c3c
esi: d37c5e70   edi: d37c5eb8   ebp: de2d80a8   esp: d37c5e34
ds: 007b   es: 007b   ss: 0068
Process fing (pid: 9641, threadinfo=d37c5000 task=d37a8550)
Stack: bf432a40 d37c5eb8 d37c5ee8 de2d80a8 c03485fd 00000000 00000001 c03cdaa0
       003cdaa0 c037a0c0 bf432a40 01048065 11320b80 00000000 00000000 00000000
       00000000 bf432a40 00000000 00000000 00000000 00000000 00000000 00000000
Call Trace:
 [<c03485fd>] ip4_datagram_connect+0x10d/0x320
 [<c0351886>] inet_dgram_connect+0x2e/0x5d
 [<c02fd477>] sys_connect+0x95/0x9e
 [<c02fc0a2>] sock_map_file+0x90/0x126
 [<c017bca5>] get_unused_fd+0x79/0x1d2
 [<c02fcf52>] __sock_create+0x128/0x1dd
 [<c02fded1>] sys_socketcall+0xb9/0x292
 [<c0103a51>] syscall_call+0x7/0xb
Code: 8b 15 14 2b 4a c0 8b 14 10 85 d2 0f 84 98 00 00 00 89 d3 eb 13 a1 20 2a 4a
c0 83 40 3c 01
 <0>Kernel panic - not syncing: Fatal exception in interrupt
 [<c01208e8>] panic+0x45/0x1e2
 [<c0104614>] die+0x222/0x2c4
 [<c0118dd5>] do_page_fault+0x1d9/0x59f
 [<c018fd25>] __link_path_walk+0xbcc/0x12db
 [<c015a270>] __do_page_cache_readahead+0x9e/0x118
 [<c0157810>] buffered_rmqueue+0x225/0x31b
 [<c015a6bc>] dbg_redzone1+0xe/0x1f
 [<c015cd49>] cache_alloc_debugcheck_after+0x31/0x11d
 [<c01281b4>] current_fs_time+0x4e/0x69
 [<c0118bfc>] do_page_fault+0x0/0x59f
 [<c0103c6b>] error_code+0x4f/0x54
 [<c0326820>] __ip_route_output_key+0x58/0x106
 [<c03485fd>] ip4_datagram_connect+0x10d/0x320
 [<c0351886>] inet_dgram_connect+0x2e/0x5d
 [<c02fd477>] sys_connect+0x95/0x9e
 [<c02fc0a2>] sock_map_file+0x90/0x126
 [<c017bca5>] get_unused_fd+0x79/0x1d2
 [<c02fcf52>] __sock_create+0x128/0x1dd
 [<c02fded1>] sys_socketcall+0xb9/0x292
 [<c0103a51>] syscall_call+0x7/0xb
 <3>BUG: soft lockup detected on CPU#0!

Pid: 9641, comm:                 fing
EIP: 0060:[<c011295b>] CPU: 0
EIP is at delay_pmtmr+0xb/0x13
 EFLAGS: 00000287    Not tainted  (2.6.12-1.1392_FC5)
EAX: 3758a456 EBX: 001e3010 ECX: 3756076e EDX: 00000291
ESI: 00000000 EDI: c03882dd EBP: 000001ad DS: 007b ES: 007b
CR0: 8005003b CR2: 6b6b6c17 CR3: 12ce0000 CR4: 000006d0
 [<c02146bd>] __delay+0x9/0xa
 [<c0120a0c>] panic+0x169/0x1e2
 [<c0104614>] die+0x222/0x2c4
 [<c0118dd5>] do_page_fault+0x1d9/0x59f
 [<c018fd25>] __link_path_walk+0xbcc/0x12db
 [<c015a270>] __do_page_cache_readahead+0x9e/0x118
 [<c0157810>] buffered_rmqueue+0x225/0x31b
 [<c015a6bc>] dbg_redzone1+0xe/0x1f
 [<c015cd49>] cache_alloc_debugcheck_after+0x31/0x11d
 [<c01281b4>] current_fs_time+0x4e/0x69
 [<c0118bfc>] do_page_fault+0x0/0x59f
 [<c0103c6b>] error_code+0x4f/0x54
 [<c0326820>] __ip_route_output_key+0x58/0x106
 [<c03485fd>] ip4_datagram_connect+0x10d/0x320
 [<c0351886>] inet_dgram_connect+0x2e/0x5d
 [<c02fd477>] sys_connect+0x95/0x9e
 [<c02fc0a2>] sock_map_file+0x90/0x126
 [<c017bca5>] get_unused_fd+0x79/0x1d2
 [<c02fcf52>] __sock_create+0x128/0x1dd
 [<c02fded1>] sys_socketcall+0xb9/0x292
 [<c0103a51>] syscall_call+0x7/0xb
 [<c0150721>] softlockup_tick+0x95/0x1b8
 [<c012cd0d>] update_wall_time+0x14/0x40
 [<c012d301>] do_timer+0x4d/0xfb
 [<c0108c2e>] timer_interrupt+0x60/0x1b5
 [<c015099d>] handle_IRQ_event+0x2e/0x5a
 [<c0150a7c>] __do_IRQ+0xb3/0x347
 [<c0105b1d>] do_IRQ+0x4a/0x82
 =======================
 [<c0103c0e>] common_interrupt+0x1a/0x20
 [<c012007b>] copy_process+0x1150/0x1255
 [<c011295b>] delay_pmtmr+0xb/0x13
 [<c02146bd>] __delay+0x9/0xa
 [<c0120a0c>] panic+0x169/0x1e2
 [<c0104614>] die+0x222/0x2c4
 [<c0118dd5>] do_page_fault+0x1d9/0x59f
 [<c018fd25>] __link_path_walk+0xbcc/0x12db
 [<c015a270>] __do_page_cache_readahead+0x9e/0x118
 [<c0157810>] buffered_rmqueue+0x225/0x31b
 [<c015a6bc>] dbg_redzone1+0xe/0x1f
 [<c015cd49>] cache_alloc_debugcheck_after+0x31/0x11d
 [<c01281b4>] current_fs_time+0x4e/0x69
 [<c0118bfc>] do_page_fault+0x0/0x59f
 [<c0103c6b>] error_code+0x4f/0x54
 [<c0326820>] __ip_route_output_key+0x58/0x106
 [<c03485fd>] ip4_datagram_connect+0x10d/0x320
 [<c0351886>] inet_dgram_connect+0x2e/0x5d
 [<c02fd477>] sys_connect+0x95/0x9e
 [<c02fc0a2>] sock_map_file+0x90/0x126
 [<c017bca5>] get_unused_fd+0x79/0x1d2
 [<c02fcf52>] __sock_create+0x128/0x1dd
 [<c02fded1>] sys_socketcall+0xb9/0x292
 [<c0103a51>] syscall_call+0x7/0xb

Comment 7 Nathan G. Grennan 2005-06-23 15:40:09 UTC
fing is a copy of ping renamed.

Comment 8 Nathan G. Grennan 2005-06-23 16:57:12 UTC
The url below talks about what sounds like the same problem. It sounds it is
fixed in 2.6.12rc5-mm1, but not in mainline/vanilla.

http://www.uwsg.iu.edu/hypermail/linux/kernel/0505.3/0631.html

Comment 9 Nathan G. Grennan 2005-06-24 01:34:59 UTC
The url above seems to not relate after all.

I have found that disabling the script I will attach to this bug report works
around the problem. The script sets up multiple default gateways and uses them
round robin style. It also sets up same in/same out. From my review of the oops
above so far it seems to relate to the multiple default gateways. I plan to do
further analysis.

Comment 10 Nathan G. Grennan 2005-06-24 01:37:05 UTC
Created attachment 115916 [details]
policy routing init script

Comment 11 Nathan G. Grennan 2005-06-24 15:21:07 UTC
The solution seems to be to disable CONFIG_IP_ROUTE_MULTIPATH_CACHED. I
recompiled 2.6.12-1.1392 with it disabled and it has been going for over an hour
without an oops. It is the the major change in the code that handles multiple
default routes between 2.6.11 and 2.6.12.


Comment 12 David Miller 2005-06-25 04:13:55 UTC
IP_ROUTE_MULTIPATH_CACHES is known to be buggy, in fact very buggy,
that's why it's marked "EXPERIMENTAL" with very big capital letters.

It should not be enabled in any kernel shipped with a distribution.

Reassigning to davej, David please turn that off in our configs.
If something is keyed on CONFIG_EXPERIMENTAL it really shouldn't
be enabled by default.


Comment 13 Dave Jones 2005-06-27 17:44:58 UTC
Removed for next build. Thanks.
There isn't actually a dependancy on CONFIG_EXPERIMENTAL for that option, it's
just marked as (EXPERIMENTAL) in the text.