Bug 138040

Summary:	kernel BUG at mm/prio_tree.c:377!
Product:	[Fedora] Fedora	Reporter:	Ray Van Dolson <rayvd>
Component:	kernel	Assignee:	Dave Jones <davej>
Status:	CLOSED NEXTRELEASE	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	2	CC:	pfrields, wtogami
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-04-16 04:33:04 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ray Van Dolson 2004-11-03 23:20:18 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; rv:1.7.3) Gecko/20041001
Firefox/0.10.1

Description of problem:
Running on an HP DL140, w/ Dual 2.4GHz Xeon's.  1GB of ECC DDR.

This server operates as a PPTP Concentrator running the PoPToP server
(1.2.1) along with pppd 2.4.3.  We have tried this system using both
the onboard Broadcom gigabit NIC's as well as a dual Intel EEPro 100.

Usually within 24 hours of bootup, the following oops occurs:

kernel BUG at mm/prio_tree.c:377!
invalid operand: 0000 [#1]
SMP nntrack(U) ip_tables(U) md5(U) ipv6(U) sunrpc(U) e100(U) mii(U)
sg(U) scsi_mod(U) microcode(U) dm_mod(U) ohci_hcd(U) button(U)
battery(U) asus_acpi(U) ac(U) ext3(U) jbd(U)
Modules linked in: ipt_LOG(U) sch_tbf(U) ppp_mppe(U) ppp_async(U)
crc_ccitt(U) ppp_generic(U) slhc(U) ipt_limit(U) ipt_REJECT(U)
ipt_multiport(U) iptable_filter(U) iptable_nat(U) ip_co
CPU:    1
EIP:    0060:[<021425de>]    Tainted: P  
EFLAGS: 00010202   (2.6.8-1.521custom) 
EIP is at prio_tree_right+0x85/0xc5
eax: 00000009   ebx: 0cf1acf8   ecx: 00000000   edx: 12da3d00
esi: 00000000   edi: 00000004   ebp: 404a6d78   esp: 0cf1ac90
ds: 007b   es: 007b   ss: 0068
Process yum (pid: 24194, threadinfo=0cf1a000 task=12e4ecb0)
Stack: 0cf1acf8 00000004 00000004 404a6d78 021427ae 00000004 0cf1acb0
0cf1acb4 
       00000000 00000043 0cf1acf8 404a6d78 00000004 08ec1ac4 02142968
00000004 
       0000007b 404a6d54 034fac80 02150cf7 00000004 00000004 00000004
00000001 
Call Trace:
 [<021427ae>] prio_tree_next+0x89/0x9b
 [<02142968>] vma_prio_tree_next+0x4b/0x63
 [<02150cf7>] page_referenced+0x14d/0x18d
 [<021478cd>] refill_inactive_zone+0x245/0x6a0
 [<0211b29e>] activate_task+0x86/0x93
 [<02147db5>] shrink_zone+0x8d/0xb4
 [<02147e1f>] shrink_caches+0x43/0x4e
 [<02147edd>] try_to_free_pages+0xb3/0x16c
 [<02140369>] __alloc_pages+0x1c8/0x2be
 [<0214bd83>] do_anonymous_page+0xb6/0x241
 [<0214bf77>] do_no_page+0x69/0x3a0
 [<0214c460>] handle_mm_fault+0xdf/0x1d4
 [<0211955b>] do_page_fault+0x17c/0x58b
 [<0214e81d>] unmap_vma_list+0xe/0x17
 [<0214ebd5>] do_munmap+0x17a/0x186
 [<0214fcef>] move_page_tables+0x3f/0x4c
 [<0214fded>] move_vma+0xf1/0x175
 [<0215017a>] do_mremap+0x309/0x32c
 [<021193df>] do_page_fault+0x0/0x58b
Code: 0f 0b 79 01 cf fa 2e 02 39 52 04 74 08 0f 0b 7a 01 cf fa 2e 

The system continues to function for approxiamately another minute.  I
see messages such as the following on the console repeatedly:

dst cache overflow 

Eventually the system becomes completely unresponsive.  When I hit the
power button, ACPI tries to power down the system, but hangs after
killing a few processes and I must hard reset it.

I do not think this is bad hardware as we have approximately 11
DL140's and this will happen on all of them although more quickly on
the ones with higher user load (network traffic, CPU usage, etc).

Version-Release number of selected component (if applicable):
kernel-smp-2.6.8-1.521

How reproducible:
Always

Steps to Reproduce:
1. Boot system.
2. Allow users to connect.
3. Wait up to 24 hours.
    

Expected Results:  System should not crash. :)

Additional info:

Comment 1 Ray Van Dolson 2004-11-04 00:03:57 UTC

A pretty good sign that the box is becoming unstable is that ntpd
starts going haywire.

Initially it starts up fine and I can use ntpdc -p to query it. 
However, if I shut down ntpd and then try and start it back up again
after a period of time, it Segfaults:

[root@chico-pptp1 20041103]# ntpd -d
ntpd 4.2.0 Thu Mar 11 11:46:39 EST 2004 (1)
addto_syslog: ntpd 4.2.0 Thu Mar 11 11:46:39 EST 2004 (1)
addto_syslog: signal_no_reset: signal 13 had flags 4000000
addto_syslog: precision = 5.000 usec
create_sockets(123)
addto_syslog: no IPv6 interfaces found
Segmentation fault

strace tells me:
write(1, "addto_syslog: no IPv6 interfaces"..., 39addto_syslog: no
IPv6 interfaces found
) = 39
socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 4
ioctl(4, SIOCGIFCONF, 0x8506018)        = 0
ioctl(4, SIOCGIFCONF, 0x8506018)        = 0
ioctl(4, SIOCGIFCONF, 0x8506018)        = 0
ioctl(4, SIOCGIFCONF, 0x8506018)        = 0
ioctl(4, SIOCGIFFLAGS, 0xfef39c30)      = 0
ioctl(4, SIOCGIFNETMASK, 0xfef39c30)    = 0

<repeats many times then process is killed>

Comment 2 Dave Jones 2004-11-04 01:15:20 UTC

can you repeat this without the binary module loaded ?

Comment 3 Ray Van Dolson 2004-11-04 01:40:06 UTC

Please excuse the dumb question... :)  Which module would you like me
to unload?

Comment 4 Ray Van Dolson 2004-11-04 16:20:07 UTC

I see, you are referring to whatever is tainting the kernel.

It's the ppp_mppe module which comes from the pppd sources.  I can't
disable this module on the server because our PPTP clients use it to
connect... maybe I can set up another system with this exact
configuration and just not allow clients to connect--but then the high
network load scenario is not present.

Comment 5 Ray Van Dolson 2004-11-04 17:31:34 UTC

I am in the process of upgrading to kernel 2.6.9-1.1_FC2 under
testing, and am applying the following patch (suggested to me on the LKML)

http://marc.theaimsgroup.com/?l=linux-kernel&m=109926628920398&q=raw

Comment 6 Ray Van Dolson 2004-11-09 05:18:50 UTC

The patch listed above *seems* to have fixed the prio_tree error I was
getting.  The system made it three days without crashing this time. 
It did lock up, but not with the prio_tree error that was occurring
regularly before.

So tentatively I'd say this bug is cleared up.  You can read the
details of the new errors here if you're curious.

http://www.ussg.iu.edu/hypermail/linux/kernel/0411.1/0297.html

Comment 7 Dave Jones 2005-04-16 04:33:04 UTC

Fedora Core 2 has now reached end of life, and no further updates will be
provided by Red Hat.  The Fedora legacy project will be producing further kernel
updates for security problems only.

If this bug has not been fixed in the latest Fedora Core 2 update kernel, please
try to reproduce it under Fedora Core 3, and reopen if necessary, changing the
product version accordingly.

Thank you.