From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; rv:1.7.3) Gecko/20041001 Firefox/0.10.1 Description of problem: Running on an HP DL140, w/ Dual 2.4GHz Xeon's. 1GB of ECC DDR. This server operates as a PPTP Concentrator running the PoPToP server (1.2.1) along with pppd 2.4.3. We have tried this system using both the onboard Broadcom gigabit NIC's as well as a dual Intel EEPro 100. Usually within 24 hours of bootup, the following oops occurs: kernel BUG at mm/prio_tree.c:377! invalid operand: 0000 [#1] SMP nntrack(U) ip_tables(U) md5(U) ipv6(U) sunrpc(U) e100(U) mii(U) sg(U) scsi_mod(U) microcode(U) dm_mod(U) ohci_hcd(U) button(U) battery(U) asus_acpi(U) ac(U) ext3(U) jbd(U) Modules linked in: ipt_LOG(U) sch_tbf(U) ppp_mppe(U) ppp_async(U) crc_ccitt(U) ppp_generic(U) slhc(U) ipt_limit(U) ipt_REJECT(U) ipt_multiport(U) iptable_filter(U) iptable_nat(U) ip_co CPU: 1 EIP: 0060:[<021425de>] Tainted: P EFLAGS: 00010202 (2.6.8-1.521custom) EIP is at prio_tree_right+0x85/0xc5 eax: 00000009 ebx: 0cf1acf8 ecx: 00000000 edx: 12da3d00 esi: 00000000 edi: 00000004 ebp: 404a6d78 esp: 0cf1ac90 ds: 007b es: 007b ss: 0068 Process yum (pid: 24194, threadinfo=0cf1a000 task=12e4ecb0) Stack: 0cf1acf8 00000004 00000004 404a6d78 021427ae 00000004 0cf1acb0 0cf1acb4 00000000 00000043 0cf1acf8 404a6d78 00000004 08ec1ac4 02142968 00000004 0000007b 404a6d54 034fac80 02150cf7 00000004 00000004 00000004 00000001 Call Trace: [<021427ae>] prio_tree_next+0x89/0x9b [<02142968>] vma_prio_tree_next+0x4b/0x63 [<02150cf7>] page_referenced+0x14d/0x18d [<021478cd>] refill_inactive_zone+0x245/0x6a0 [<0211b29e>] activate_task+0x86/0x93 [<02147db5>] shrink_zone+0x8d/0xb4 [<02147e1f>] shrink_caches+0x43/0x4e [<02147edd>] try_to_free_pages+0xb3/0x16c [<02140369>] __alloc_pages+0x1c8/0x2be [<0214bd83>] do_anonymous_page+0xb6/0x241 [<0214bf77>] do_no_page+0x69/0x3a0 [<0214c460>] handle_mm_fault+0xdf/0x1d4 [<0211955b>] do_page_fault+0x17c/0x58b [<0214e81d>] unmap_vma_list+0xe/0x17 [<0214ebd5>] do_munmap+0x17a/0x186 [<0214fcef>] move_page_tables+0x3f/0x4c [<0214fded>] move_vma+0xf1/0x175 [<0215017a>] do_mremap+0x309/0x32c [<021193df>] do_page_fault+0x0/0x58b Code: 0f 0b 79 01 cf fa 2e 02 39 52 04 74 08 0f 0b 7a 01 cf fa 2e The system continues to function for approxiamately another minute. I see messages such as the following on the console repeatedly: dst cache overflow Eventually the system becomes completely unresponsive. When I hit the power button, ACPI tries to power down the system, but hangs after killing a few processes and I must hard reset it. I do not think this is bad hardware as we have approximately 11 DL140's and this will happen on all of them although more quickly on the ones with higher user load (network traffic, CPU usage, etc). Version-Release number of selected component (if applicable): kernel-smp-2.6.8-1.521 How reproducible: Always Steps to Reproduce: 1. Boot system. 2. Allow users to connect. 3. Wait up to 24 hours. Expected Results: System should not crash. :) Additional info:
A pretty good sign that the box is becoming unstable is that ntpd starts going haywire. Initially it starts up fine and I can use ntpdc -p to query it. However, if I shut down ntpd and then try and start it back up again after a period of time, it Segfaults: [root@chico-pptp1 20041103]# ntpd -d ntpd 4.2.0 Thu Mar 11 11:46:39 EST 2004 (1) addto_syslog: ntpd 4.2.0 Thu Mar 11 11:46:39 EST 2004 (1) addto_syslog: signal_no_reset: signal 13 had flags 4000000 addto_syslog: precision = 5.000 usec create_sockets(123) addto_syslog: no IPv6 interfaces found Segmentation fault strace tells me: write(1, "addto_syslog: no IPv6 interfaces"..., 39addto_syslog: no IPv6 interfaces found ) = 39 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 4 ioctl(4, SIOCGIFCONF, 0x8506018) = 0 ioctl(4, SIOCGIFCONF, 0x8506018) = 0 ioctl(4, SIOCGIFCONF, 0x8506018) = 0 ioctl(4, SIOCGIFCONF, 0x8506018) = 0 ioctl(4, SIOCGIFFLAGS, 0xfef39c30) = 0 ioctl(4, SIOCGIFNETMASK, 0xfef39c30) = 0 <repeats many times then process is killed>
can you repeat this without the binary module loaded ?
Please excuse the dumb question... :) Which module would you like me to unload?
I see, you are referring to whatever is tainting the kernel. It's the ppp_mppe module which comes from the pppd sources. I can't disable this module on the server because our PPTP clients use it to connect... maybe I can set up another system with this exact configuration and just not allow clients to connect--but then the high network load scenario is not present.
I am in the process of upgrading to kernel 2.6.9-1.1_FC2 under testing, and am applying the following patch (suggested to me on the LKML) http://marc.theaimsgroup.com/?l=linux-kernel&m=109926628920398&q=raw
The patch listed above *seems* to have fixed the prio_tree error I was getting. The system made it three days without crashing this time. It did lock up, but not with the prio_tree error that was occurring regularly before. So tentatively I'd say this bug is cleared up. You can read the details of the new errors here if you're curious. http://www.ussg.iu.edu/hypermail/linux/kernel/0411.1/0297.html
Fedora Core 2 has now reached end of life, and no further updates will be provided by Red Hat. The Fedora legacy project will be producing further kernel updates for security problems only. If this bug has not been fixed in the latest Fedora Core 2 update kernel, please try to reproduce it under Fedora Core 3, and reopen if necessary, changing the product version accordingly. Thank you.