From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.9) Gecko/20020408 Description of problem: I am using kernel-2.4.18-4 (compiled for i686-redhat-linux, as reported by "rpm -q kernel --qf '%{PLATFORM}\n'" ). After a variable uptime (anything from a few minutes to many hours), the kernel oopses in the pte_chain_alloc function (in the mm/rmap.c file). There does not seem to be anything specific causing this -- it tends to simply occur without warning (usually while the computer is actively being used locally, rather than remotely). The oopses seem to occur at line 356 in rmap.c. It appears that the pte_chain_freelist pointer is being set to some bogus value (often, but not always, a nice round number such as 0x54000000 or 0x14000000). I'll attach a sample oops (passed through ksymoops), as well as output from dmesg, lspci, and "cat /proc/cpuinfo". Version-Release number of selected component (if applicable): How reproducible: Sometimes Steps to Reproduce: 1. Boot with 2.4.18-4 kernel 2. Do stuff -- oops usually occurs within a few hours of operation 3. Actual Results: Kernel oopses in pte_chain_alloc Expected Results: Kernel shouldn't oops :-) Additional info:
Created attachment 58009 [details] Oops in pte_chain_alloc (processed by ksymoops)
Created attachment 58010 [details] dmesg output (from boot immediately after the oops)
Created attachment 58011 [details] lspci output
Created attachment 58013 [details] cat /proc/cpuinfo
I've tried narrowing down this problem a bit further. Firstly, I've run memtest86 for over 12 hours and it hasn't reported a single problem. Secondly, I've run cpuburn for about an hour and a half, and it didn't report any problems either. I also wrote a small program, crash.c (I'll attach this), which deliberately tries to fragment memory as much as possible, hopefully causing the kernel the oops. It seems to work pretty well -- using this program I am able to oops the 2.4.18-4 kernel even while I'm in runlevel 1 with no other programs running, and no modules loaded except ext3 and jbd, both with swap turned on and swap turned off. The oopses are always on the same line in rmap.c . I have also compiled kernel 2.4.16. I used Red Hat's .config file for my architecture so that the configuration was as close to the 2.4.18-4 configuration as possible. With this kernel, I can't get crash.c to oops the kernel. However, in the two days I've been running it X has crashed once. This could be due to something completely different -- I'll try to narrow it down further. I really hope to get the 2.4.18 kernel working soon -- in particular, it's got a later version of the drm modules which X 4.2.0 needs to give me hardware acceleration.
Created attachment 58845 [details] Fragment memory
I can confirm the problem is specifically with Rik van Riel's reverse-mapping patch. I have tested both the standard 2.4.18 kernel, and the same kernel with the latest applicable rmap patch (rmap12h). With the standard kernel there were no problems; with the patched kernel I am able to crash it very quickly (on exactly the same line in rmap.c). Since this is not really a Red Hat bug, I guess I ought to bring this to the kernel mailing list.
I ran this program for about two weeks on a 2GB 2x1GHz Xeon Dell server without seeing this problem, so I think there is some sort of hardware dependency here.