From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050513 Galeon/1.3.21 Description of problem: I have a server at colo. When I start seeding two torrents at once the system will freeze within 5 minutes. I have tried replacing hardware, and it didn't help. Most of the time it just freezes, but I have twice seen an oops. I did try using a serial console, but was unsuccessful in collecting the oops information. I also tried netconsole, but the remote computer didn't see any information from the computer having the problem. I seems to have narrowed it down to the kernel. If I downgrade to a FC3 kernel like 2.6.11-1.27 or 2.6.11-1.35, it works, instead of seconds or minutes. I also tried going to the latest development kernel, 2.6.11-1.1383. It still froze. In addition I tried 2.6.11-1.1226 from FC4T2 and 2.6.11-1.1286 from FC4T3, both froze. I suspect it relates to a change from 2.6.11.X to 2.6.12rcX. I am curious how a kernel based on the final release of 2.6.12 will react. Version-Release number of selected component (if applicable): kernel-2.6.11-1.1369_FC4 How reproducible: Always Steps to Reproduce: 1. btdownloadcurses.py --url 'url' 2. btdownloadcurses.py --url 'url' Actual Results: Within a few minutes or seconds it will freeze or oops Expected Results: Doesn't freeze or oops Additional info: I am willing to try new ways to collection oops information, or other kernels.
The setup is special in a few ways. One is that it is using a script to do same in/same out. So incoming connections go back out the correct interface and multiple networks are acceptable. In addition the outbound is setup for round robin. So the first connection goes out eth0, and the next goes out eth1. Being at a colo, instead of a normal 3/256 type connect it has two 10/10 connections. Another way it is special is that it has four network cards. Two on one network, and the others on their own networks. One onboard and three pci cards. I tried didn't brands of network card, and hence different drivers with the same effect. Drivers I tried were e100, eepro100, tulip, and 8139too. One of the interfaces is also in promucious mode for use with ntop, a network traffic monitor. I cap the torrents at 300k/s, but suspect it may be more disk access based. The seed files are hundreds of megs, and sometimes the system locks up within a few seconds of starting the second torrent.
I tried recompiling 2.6.11-1.1369 with 3.2.3 instead of 4.0.0 after seeing plenty of discussion on if it was wise to compile the kernel with 4.0.0. Many suggested waiting for 4.0.1. After many hours(boy does compiling UP, SMP, XenU, and Xen0 take a while) it compiled, and this morning I gave it a try. It still froze on me.
I have tried kernel-2.6.12-1.1387_FC5 on the system and it seems to fix the freeze problem. It would be nice to see a similar kernel officially released for FC4.
Well, I thought it was fixed. It took a lot longer this time, but it did lock up again with 2.6.12-1.1387.
Here is an Oops that was triggered in the same way. Note, this came from a form of 2.6.12-1.1369 that was recompiled after being stripped down to the least number of patches I could. The idea was to see if it was Fedora only or not. Reducing the number of patches seemed to have helped cause a Oops more than just a freeze. Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: 00000000 *pde = 05646067 Oops: 0000 [#1] Modules linked in: md5 ipv6 dm_mod video button battery ac loop uhci_hcd ehci_hcd shpchp hw_random i2c_i801 i2c_core snd_intel8x0 snd_ac97_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc e100 mii floppy ext3 jbd CPU: 0 EIP: 0060:[<00000000>] Not tainted VLI EFLAGS: 00010246 (2.6.11-1.1369_FC4.root) EIP is at _stext+0x3feffdd8/0x8 eax: cc697a00 ebx: 00000000 ecx: 00000000 edx: cc697a00 esi: 00000000 edi: c0471a20 ebp: bff4dfe8 esp: c044bfd4 ds: 007b es: 007b ss: 0068 Process genpkgmetadata. (pid: 18703, threadinfo=c044b000 task=cd223aa0) Stack: c0138fe2 00000000 c0470ec8 0000000a c0127f19 00000001 c0127cee c0f5ef98 00000046 00000000 c01055c9 Call Trace: [<c0138fe2>] rcu_do_batch+0x1a/0x57 [<c0127f19>] tasklet_action+0x32/0x5d [<c0127cee>] __do_softirq+0x3e/0x8a [<c01055c9>] do_softirq+0x3e/0x42 ======================= [<c01054c4>] do_IRQ+0x51/0x82 [<c01039be>] common_interrupt+0x1a/0x20 Code: Bad EIP value. <0>Kernel panic - not syncing: Fatal exception in interrupt Here is part of a previous Oops with the same kernel. EFLAGS: 00010246 (2.6.11-1.1369_FC4.root) EIP is at _stext+0x3feffdd8/0x8 eax: d6dd7900 ebx: 00000000 ecx: 00000000 edx: d6dd7900 esi: 00000000 edi: c0471a20 ebp: 00dbd035 esp: c044bfd4 ds: 007b es: 007b ss: 0068 Process ntop (pid: 2157, threadinfo=c044b000 task=c17dbaa0) Stack: c0138fe2 00000000 c0470ec8 0000000a c0127f19 00000001 c0127cee dcaa1f24 00000046 00000000 c01055c9 Call Trace: [<c0138fe2>] rcu_do_batch+0x1a/0x57 [<c0127f19>] tasklet_action+0x32/0x5d [<c0127cee>] __do_softirq+0x3e/0x8a [<c01055c9>] do_softirq+0x3e/0x42 ======================= [<c01054c4>] do_IRQ+0x51/0x82 [<c01039be>] common_interrupt+0x1a/0x20 [<c014007b>] posix_cpu_timer_set+0x6c1/0x8c5 [<c0111fc0>] get_offset_pmtmr+0x17/0x1057 [<c0107e66>] do_gettimeofday+0x1a/0xc4 [<c01271bf>] sys_time+0xf/0x30 [<c01037a7>] sysenter_past_esp+0x54/0x75 Code: Bad EIP value. <0>Kernel panic - not syncing: Fatal exception in interrupt
Here is Oops output from 2.6.12-1.1392_FC5: Unable to handle kernel paging request at virtual address 6b6b6c17 printing eip: c0326820 *pde = 00000000 Oops: 0000 [#1] Modules linked in: md5 ipv6 dm_mod video button battery ac loop uhci_hcd ehci_hcd shpchp hw_randCPU: 0 EIP: 0060:[<c0326820>] Not tainted VLI EFLAGS: 00010202 (2.6.12-1.1392_FC5) EIP is at __ip_route_output_key+0x58/0x106 eax: bf432a40 ebx: 6b6b6b6b ecx: ec070bff edx: d6033c3c esi: d37c5e70 edi: d37c5eb8 ebp: de2d80a8 esp: d37c5e34 ds: 007b es: 007b ss: 0068 Process fing (pid: 9641, threadinfo=d37c5000 task=d37a8550) Stack: bf432a40 d37c5eb8 d37c5ee8 de2d80a8 c03485fd 00000000 00000001 c03cdaa0 003cdaa0 c037a0c0 bf432a40 01048065 11320b80 00000000 00000000 00000000 00000000 bf432a40 00000000 00000000 00000000 00000000 00000000 00000000 Call Trace: [<c03485fd>] ip4_datagram_connect+0x10d/0x320 [<c0351886>] inet_dgram_connect+0x2e/0x5d [<c02fd477>] sys_connect+0x95/0x9e [<c02fc0a2>] sock_map_file+0x90/0x126 [<c017bca5>] get_unused_fd+0x79/0x1d2 [<c02fcf52>] __sock_create+0x128/0x1dd [<c02fded1>] sys_socketcall+0xb9/0x292 [<c0103a51>] syscall_call+0x7/0xb Code: 8b 15 14 2b 4a c0 8b 14 10 85 d2 0f 84 98 00 00 00 89 d3 eb 13 a1 20 2a 4a c0 83 40 3c 01 <0>Kernel panic - not syncing: Fatal exception in interrupt [<c01208e8>] panic+0x45/0x1e2 [<c0104614>] die+0x222/0x2c4 [<c0118dd5>] do_page_fault+0x1d9/0x59f [<c018fd25>] __link_path_walk+0xbcc/0x12db [<c015a270>] __do_page_cache_readahead+0x9e/0x118 [<c0157810>] buffered_rmqueue+0x225/0x31b [<c015a6bc>] dbg_redzone1+0xe/0x1f [<c015cd49>] cache_alloc_debugcheck_after+0x31/0x11d [<c01281b4>] current_fs_time+0x4e/0x69 [<c0118bfc>] do_page_fault+0x0/0x59f [<c0103c6b>] error_code+0x4f/0x54 [<c0326820>] __ip_route_output_key+0x58/0x106 [<c03485fd>] ip4_datagram_connect+0x10d/0x320 [<c0351886>] inet_dgram_connect+0x2e/0x5d [<c02fd477>] sys_connect+0x95/0x9e [<c02fc0a2>] sock_map_file+0x90/0x126 [<c017bca5>] get_unused_fd+0x79/0x1d2 [<c02fcf52>] __sock_create+0x128/0x1dd [<c02fded1>] sys_socketcall+0xb9/0x292 [<c0103a51>] syscall_call+0x7/0xb <3>BUG: soft lockup detected on CPU#0! Pid: 9641, comm: fing EIP: 0060:[<c011295b>] CPU: 0 EIP is at delay_pmtmr+0xb/0x13 EFLAGS: 00000287 Not tainted (2.6.12-1.1392_FC5) EAX: 3758a456 EBX: 001e3010 ECX: 3756076e EDX: 00000291 ESI: 00000000 EDI: c03882dd EBP: 000001ad DS: 007b ES: 007b CR0: 8005003b CR2: 6b6b6c17 CR3: 12ce0000 CR4: 000006d0 [<c02146bd>] __delay+0x9/0xa [<c0120a0c>] panic+0x169/0x1e2 [<c0104614>] die+0x222/0x2c4 [<c0118dd5>] do_page_fault+0x1d9/0x59f [<c018fd25>] __link_path_walk+0xbcc/0x12db [<c015a270>] __do_page_cache_readahead+0x9e/0x118 [<c0157810>] buffered_rmqueue+0x225/0x31b [<c015a6bc>] dbg_redzone1+0xe/0x1f [<c015cd49>] cache_alloc_debugcheck_after+0x31/0x11d [<c01281b4>] current_fs_time+0x4e/0x69 [<c0118bfc>] do_page_fault+0x0/0x59f [<c0103c6b>] error_code+0x4f/0x54 [<c0326820>] __ip_route_output_key+0x58/0x106 [<c03485fd>] ip4_datagram_connect+0x10d/0x320 [<c0351886>] inet_dgram_connect+0x2e/0x5d [<c02fd477>] sys_connect+0x95/0x9e [<c02fc0a2>] sock_map_file+0x90/0x126 [<c017bca5>] get_unused_fd+0x79/0x1d2 [<c02fcf52>] __sock_create+0x128/0x1dd [<c02fded1>] sys_socketcall+0xb9/0x292 [<c0103a51>] syscall_call+0x7/0xb [<c0150721>] softlockup_tick+0x95/0x1b8 [<c012cd0d>] update_wall_time+0x14/0x40 [<c012d301>] do_timer+0x4d/0xfb [<c0108c2e>] timer_interrupt+0x60/0x1b5 [<c015099d>] handle_IRQ_event+0x2e/0x5a [<c0150a7c>] __do_IRQ+0xb3/0x347 [<c0105b1d>] do_IRQ+0x4a/0x82 ======================= [<c0103c0e>] common_interrupt+0x1a/0x20 [<c012007b>] copy_process+0x1150/0x1255 [<c011295b>] delay_pmtmr+0xb/0x13 [<c02146bd>] __delay+0x9/0xa [<c0120a0c>] panic+0x169/0x1e2 [<c0104614>] die+0x222/0x2c4 [<c0118dd5>] do_page_fault+0x1d9/0x59f [<c018fd25>] __link_path_walk+0xbcc/0x12db [<c015a270>] __do_page_cache_readahead+0x9e/0x118 [<c0157810>] buffered_rmqueue+0x225/0x31b [<c015a6bc>] dbg_redzone1+0xe/0x1f [<c015cd49>] cache_alloc_debugcheck_after+0x31/0x11d [<c01281b4>] current_fs_time+0x4e/0x69 [<c0118bfc>] do_page_fault+0x0/0x59f [<c0103c6b>] error_code+0x4f/0x54 [<c0326820>] __ip_route_output_key+0x58/0x106 [<c03485fd>] ip4_datagram_connect+0x10d/0x320 [<c0351886>] inet_dgram_connect+0x2e/0x5d [<c02fd477>] sys_connect+0x95/0x9e [<c02fc0a2>] sock_map_file+0x90/0x126 [<c017bca5>] get_unused_fd+0x79/0x1d2 [<c02fcf52>] __sock_create+0x128/0x1dd [<c02fded1>] sys_socketcall+0xb9/0x292 [<c0103a51>] syscall_call+0x7/0xb
fing is a copy of ping renamed.
The url below talks about what sounds like the same problem. It sounds it is fixed in 2.6.12rc5-mm1, but not in mainline/vanilla. http://www.uwsg.iu.edu/hypermail/linux/kernel/0505.3/0631.html
The url above seems to not relate after all. I have found that disabling the script I will attach to this bug report works around the problem. The script sets up multiple default gateways and uses them round robin style. It also sets up same in/same out. From my review of the oops above so far it seems to relate to the multiple default gateways. I plan to do further analysis.
Created attachment 115916 [details] policy routing init script
The solution seems to be to disable CONFIG_IP_ROUTE_MULTIPATH_CACHED. I recompiled 2.6.12-1.1392 with it disabled and it has been going for over an hour without an oops. It is the the major change in the code that handles multiple default routes between 2.6.11 and 2.6.12.
IP_ROUTE_MULTIPATH_CACHES is known to be buggy, in fact very buggy, that's why it's marked "EXPERIMENTAL" with very big capital letters. It should not be enabled in any kernel shipped with a distribution. Reassigning to davej, David please turn that off in our configs. If something is keyed on CONFIG_EXPERIMENTAL it really shouldn't be enabled by default.
Removed for next build. Thanks. There isn't actually a dependancy on CONFIG_EXPERIMENTAL for that option, it's just marked as (EXPERIMENTAL) in the text.