Red Hat Bugzilla – Bug 149087
Kernel panic regression in 2.4.21-27.0.2.ELsmp
Last modified: 2007-11-30 17:07:06 EST
Description of problem: "Kernel panic: Fatal exception" in "handle_mm_fault" Version-Release number of selected component (if applicable): 2.4.21-27.0.2.ELsmp How reproducible: We have been running Exim 4.44 with Sophie antivirus and SpamAssasin, variously communicating over Unix sockets. This setup has been rock solid. We applied the kernel errata listed above yesterday (1750 17 Feb) and today at 1445 the server halted with a kernel panic at the console. Unfortunately the only access we have to the console is via graphical ILO management, and there is no scrollback on the remote console, so we only have part of the traceback; however it seems to indicate a double fault? Screenshot of the ILO console attached. Steps to Reproduce: 1. Boot 2.4.21-27.0.2.ELsmp 2. Allow service to run 3. Crash Actual results: Kernel panic Expected results: Normal function Additional info: We've rolled back to 2.4.21-20.ELsmp on this box for the moment. This is our production mail service, one of 4 near-identical machines, all of which have been upgraded. The load on the machine was rising at the time of the crash, so possibly it's a load-induced crash? Hardware is a dual-processor HP DL380 G3 with fairly standard server-level components; kernel bootup output attached for the *crashing* kernel
Created attachment 111210 [details] Screenshot of (remote) console showing kernel panic
Created attachment 111211 [details] Syslog output for a bootup with the crashing kernel
In handling a page fault for a user address in the "sophie" process, handle_mm_fault() call pte_alloc_map() to finf, or potentially create, the page table containing the pte_t for the faulting address, kmap() it, and add the offset to the pte_t entry within the kmap'd page table page, returning the temporary mapping in "pte": handle_mm_fault() ... if (pmd) { pte_t * pte = pte_alloc_map(mm, pmd, address); if (pte) <<line 1935 in memory.c>> return handle_pte_fault(mm, vma, address, write_access, pte, pmd); } ... "pte" was passed as an argument to handle_pte_fault(), which is is inlined into handle_mm_fault(), and the instruction causing the fault is the attempted assignment of what pte points to into the "entry" variable: static inline int handle_pte_fault(struct mm_struct *mm, struct vm_area_struct * vma, unsigned long address, int write_access, pte_t *pte, pmd_t *pmd) { pte_t entry; entry = *pte; <<line 1876 in memory.c>> if (!pte_present(entry)) { ... Line 1935 is the "if (pte)" in handle_mm_fault(), the oops occurred in the mov intruction below at line 1876 in handle_pte_fault(): /usr/src/debug/kernel-2.4.21/linux-2.4.21/mm/memory.c: 1935 0xc0143bea <handle_mm_fault+0x7a>: je 0xc0143dd0 <handle_mm_fault+0x260> /usr/src/debug/kernel-2.4.21/linux-2.4.21/mm/memory.c: 1876 0xc0143bf0 <handle_mm_fault+0x80>: mov (%eax),%ecx Although the oops message is clipped, I'm presuming that it indicated a faulting address of fffb8d20, since that's what's in EAX. That would be impossible as a kmap address, because the range of available kmap addresses are fixed: #define PKMAP_BASE (0xff000000UL) #define NR_SHARED_PMDS ((0xffffffff-PKMAP_BASE+1)/PMD_SIZE) #ifdef CONFIG_X86_PAE #define LAST_PKMAP 512 #else #define LAST_PKMAP 1024 #endif #define LAST_PKMAP_MASK (LAST_PKMAP-1) #define PKMAP_NR(virt) ((virt-PKMAP_BASE) >> PAGE_SHIFT) #define PKMAP_ADDR(nr) (PKMAP_BASE + ((nr) << PAGE_SHIFT)) With PAE turned on, as is the case in an smp kernel, then, there could be a maximum of 512 pages available to kmap(), starting at the PKMAP_BASE address of 0xff000000, and ranging to a maximum of PKMAP_BASE plus PKMAP_ADDR(511) or ff1ff000. A kmap address of fffb8d20 is way out of range. That all being said, it's impossible to determine how that happened with a simple oops display. We would either need to be given a reproducer, or have you reproduce it with either netdump or diskdump enabled, and hope that a dumpfile would afford us additional clues, in order to have a prayer of debugging this any further. Would either of those two options be amenable to you?
Created attachment 111676 [details] Screenshot of console during panic
I believe I've seen this as well -- I'm attaching a screenshot of the console with panic. I've also configured a crontab with a battery of echo "foo" > /proc/sysrq-trigger to get more meat on this. Unfortunately, with netdump configured, my systems seem to reboot automagically while trying to send a vmcore over -- the most I get is some oops info. I'm running the sysrq's every 2 seconds to try to get the state of the system right before it happens. If this happens again and I get anything useful out of it, I'll attach it here as well. I have a group of systems panicing every couple of days or weeks with 3 or 4 different sets of call traces, all running U4. The only common factor is that these systems are heavy NFS clients and run large java JVMs (i.e. multiple 2.3GB processes). -charles.
Take a look in your netdump-server's /var/log/messages for the reason why it sent the reboot request to the panicking client. Once a panicking client starts a netdump operation, it will only reboot automatically upon receiving a command to do so from the netdump-server.
Also, even though the vmcore file wasn't created, you mention that there is the log file on the netdump-server. In that log file, you should see the same output as the screenshot of the console. What does the line just above the "CPU: 1" indicate? It should be something of the sort: kernel BUG in header file at line xxx
In this particular instance, this host was not configured as a netdump client, although now it is. If it panics again, I'll have the output. I do have netdump logs of the 2 other systems and will attach. I find it too coincidental that 3 machines in the same group panic with different call traces running the same jobs. This is all heavy NFS / large java jvm process / perl working with large datasets. -charles.
Gotcha -- thanks... Further note -- the out_of_line() reference made me realize that my ramblings about the mapped addresses starting at PKMAP_BASE going up to PKMAP_ADDR(511) are incorrect. pte_alloc_map() uses kmap_atomic(), which use a fixed-set of specific-use, per-cpu, pte's existing at the fixed address range (by my calculations) from ff02a000 up to ff469000. Actually only NR_CPUS-worth of pages in that range are used by pte_alloc_map(), but it's the base-to-top values that still would exclude fffb8d20.
Created attachment 111679 [details] netdump log of "qt8" crashing
Created attachment 111680 [details] netdump log of "qt9" panicing
Those two look to be completely unrelated to the handle_mm_fault issue we're trying to track down here. In fact, they look suspiciously like the issue in bugzilla #132838. In that bugzilla, comment #61 has pointers to test kernels that may address the problems seen in qt8 and qt9. However, if your system is not accessing /proc/kcore (perhaps when doing a tar of the whole system from /) then it's not going to be the same issue. In any case, let's keep this case specific to panics occurring from a user-space do_page_fault() leading to handle_mm_fault(), with the panic occuring there.
Sure thing, I'll append the next time I see the handle_mm_fault() panic again. I've got my netdump clients waiting. I read through the other bugzilla for more details, but I can't think of any scenario where tar would be accessing /proc/kcore on these systems. I merely included those two netdump logs because I thought it was too coincidental that 3 boxes in the same group, running the same jobs were crashing within days of each other. Perhaps this is just the servers are having more than 1 issue manifest itself here. -charles.
Ok, this didn't take long -- I got another panic with handle_mm_fault() in the call trace. Unfortunately, no vmcore was transferred, the console shows "tg3: transmission timeout" or some such. I'm appending the screenshot, the netdump log file, and the /var/adm/messages with a battery of sysrq-triggers every 2 minutes. -charles.
Created attachment 111700 [details] screenshot of "qt10" crash
Created attachment 111701 [details] netdump log of "qt10" shortly preceding crash The last log sent to netdump doesn't seem to actually contain the panic & call trace. From the screen shot of the crash, you can see the tg3 error following the call trace complaining about a timeout. As it stands, I can't get a vmcore or even much of any oops info during the actual crash. -charles.
Created attachment 111702 [details] netdump log of "qt10" shortly preceding crash The last log sent to netdump doesn't seem to actually contain the panic & call trace. From the screen shot of the crash, you can see the tg3 error following the call trace complaining about a timeout. As it stands, I can't get a vmcore or even much of any oops info during the actual crash. -charles.
Created attachment 111703 [details] messages file from "qt10" Here's a messages file corresponding to the following cronjob: [root@qt9 ccf]# crontab -l */2 * * * * /bin/echo "m" > /proc/sysrq-trigger */2 * * * * /bin/echo "t" > /proc/sysrq-trigger */2 * * * * /bin/echo "w" > /proc/sysrq-trigger */2 * * * * /bin/echo "p" > /proc/sysrq-trigger Not sure if it'll help, but I'm trying to gather as much debugging data as possible. -charles.
Right -- there's no oops message anywhere here to work with. The "handle_mm_fault" seen in the "echo" process trace in the "qt10" attachment appears to be a stale leftover return address on the stack, and the process 8185 has successfully made it to schedule(). The abundance of sysrq-trigger writes looks to be overwhelming the tg3, and possibly keeping the subsequent netdump operation from proceeding. Just let the system run normally with netdump enabled, and wait for one of the real handle_mm_fault faults to occur.
Ok, I've disabled all the cronjobs and made sure netdump was running on the 3 crashing systems. Is there anything else I can do to ensure that a vmcore gets transferred? I haven't been able to get a single vmcore transferred yet -- I get the occasional vmcore-incomplete file. -charles.
If the vmcore-incomplete files are only 4k in length, then there is a problem with packets being sent back from the netdump-server to the panicking client being dropped. If the file has more than 4k, and up to ~1GB in length, then it's very possible that the vmcore-incomplete may still be able to be analyzed. The /var/log/messages file on the netdump-server will have error messages from the netdump-server daemon indicating why it bailed out.
Unless I'm mistaken, the files are 0 bytes in size. I also checked the messages file and see no netdump log entries other than when clients connect: [root@master 03]# grep netdump /var/log/messages Mar 7 10:08:03 master sshd[30847]: pam_krb5: authentication fails for `netdump' Mar 7 10:08:03 master sshd[30847]: Accepted password for netdump from 65.202.163.175 port 33354 Mar 7 10:08:03 master sshd(pam_unix)[30883]: session opened for user netdump by (uid=34) Mar 7 10:08:03 master sshd(pam_unix)[30883]: session closed for user netdump Mar 7 10:08:38 master sshd[31005]: pam_krb5: authentication fails for `netdump' Mar 7 10:08:38 master sshd[31005]: Accepted password for netdump from 65.202.163.174 port 32789 Mar 7 10:08:38 master sshd(pam_unix)[31014]: session opened for user netdump by (uid=34) Mar 7 10:08:38 master sshd(pam_unix)[31014]: session closed for user netdump Mar 7 10:24:05 master sshd[3405]: pam_krb5: authentication fails for `netdump' Mar 7 10:24:05 master sshd[3405]: Accepted password for netdump from 65.202.163.159 port 32798 Mar 7 10:24:05 master sshd(pam_unix)[3418]: session opened for user netdump by (uid=34) Mar 7 10:24:05 master sshd(pam_unix)[3418]: session closed for user netdump -charles.
Some more info... here's dmesg on a client: netlog: using network device <eth0> netlog: using source IP 65.202.163.159 netlog: using source UDP port: 6666 netlog: using netdump target IP 65.202.162.18 netlog: using netlog target IP 65.202.162.18 netlog: using target UDP port: 6666 netlog: using target UDP port: 6666 netlog: using netdump target ethernet address 00:0e:7f:fe:df:4f. netlog: using netdump target ethernet address 00:0e:7f:fe:df:4f. netlog: using broadcast ethernet frames to send syslog packets. netlog: network logging started up successfully! but here's what /var/crash/65.202.163.159-2005-03-07-10:50/log says: [...network console shutdown...] Is this normal?
I'm not sure of the timeline here. If you got a "network console shutdown" message from 65.202.163.159, then the netconsole module on that client was explicitly shutdown during a module unload: static void cleanup_netconsole(void) { printk(KERN_INFO "netlog: network logging shut down.\n"); unregister_console(&netconsole); #define SHUTDOWN_MSG "[...network console shutdown...]\n" write_netconsole_msg(NULL, SHUTDOWN_MSG, strlen(SHUTDOWN_MSG)); dev_put(netconsole_dev); netconsole_dev = NULL; netdump_unregister_hooks(); } The write_netconsole_msg() would have sent the message that was put into the remote log file, but there should also have been the "netlog: network logging shut down." message on the client, or at least in the client's /var/log/messages.
This is strange -- [...network console shutdown...] shows up immediately after a service netdump start or restart. I can reproduce this consistently. dmesg on the client shows a successful startup and the server immediately shows a console shutdown message. i.e. [root@qt9 root]# date Mon Mar 7 11:23:58 EST 2005 [root@qt9 root]# service netdump restart disabling netdump [ OK ] netdump@65.202.162.18's password: initializing netdump [ OK ] [root@qt9 root]# date Mon Mar 7 11:24:11 EST 2005 [root@master crash]# ls -l total 24 drwx------ 2 netdump netdump 4096 Mar 7 10:50 65.202.163.159-2005-03-07-10:50 drwx------ 2 netdump netdump 4096 Mar 7 11:19 65.202.163.159-2005-03-07-11:19 drwx------ 2 netdump netdump 4096 Mar 7 11:20 65.202.163.174-2005-03-07-11:20 drwx------ 2 netdump netdump 4096 Mar 7 11:24 65.202.163.174-2005-03-07-11:24 drwx------ 2 netdump netdump 4096 Mar 4 15:52 magic drwxr-xr-x 2 netdump netdump 4096 Jun 17 2004 scripts [root@master crash]# cat 65.202.163.174-2005-03-07-11:24/log [...network console shutdown...] [root@master crash]# [root@qt9 root]# lsmod | grep netconsole netconsole 16268 0 (unused) -charles.
Nevermind, this is just a goof on my part. I restart, when I thought I did a start had me thinking that it was shutting down immediately after starting when it was just logging the shutdown preceding the startup. :-) -charles.
An update to my previous updates... My claim that the kmap'd PTE address of fffb8d20 (as seen in attachment 111210 [details]) being out of range is incorrect. It absolutely is correct for a fault occurring on cpu 0, in which case the temporary kmap'ing of page table pages would be fffb8000. (My two previous calculations were done incorrectly.) That being said, it is not clear why the reference to it would cause an oops, especially considering that it was just mapped, and there are no places where it could have blocked. The second handle_mm_fault() in attachment 111676 [details] is obviously related, but again it's not clear how it happened. In that case, the out_of_line() call was made after the kmap'd vaddr was calculated, where it verifies that the PTE that it is about to set up for that vaddr is not empty, i.e., !pte_none(): static inline void *__kmap_atomic(struct page *page, enum km_type type) { enum fixed_addresses idx; unsigned long vaddr; idx = type + KM_TYPE_NR*smp_processor_id(); vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); #if HIGHMEM_DEBUG if (!pte_none(*(kmap_pte-idx))) out_of_line_bug(); #else ... The PTE *should* have been cleared by the prior user. Again, there's no obvious explanation.
Does the hardware in this case have greater than 4GB of memory?
All of my effected systems have >4GB -- most are 8GB, some are 6GB. Thanks, -charles.
There's a very strong possibility that the corruption issue that you are seeing will be resolved in the upcoming RHEL3-U5 kernel. The fix is being tested at several customer sites at this time; an official Red Hat kernel will be made available when the fix is confirmed. The problem has only been reproduced on systems: 1. having > 4GB 2. running RHEL3-U4 (2.4.21-27) or later. 3. typically running large numbers of java and/or oracle processes with large virtual address spaces. In those situations, user data from a java process gets written to essentially random locations in the kernel data section, such as in the mem_map[] array, the dentry_hashtable[] array, the page_hash_table[] array or in any number of different memory slab pages. The eventual crash/oops has therefore resulted in unrelated back traces because a random, innocent, kernel data structure has been corrupted. While it cannot be absolutely pinned on that particular bug without a netdump/diskdump to view the corruption around the kernel data which was innocently corrupted, if the requirements above match the systems exhibiting the problem, there's a better than good chance that this is the problem.
I have been running the -31 beta kernel on some of these hosts for a few weeks now without a single crash. I'm remaining hopeful so far. :-) -charles.
Well, actually the -31 beta kernel doesn't have the fix I'm referring to, but there are several other potential corruptor fixes in there as well.
Oh, I see. Is there any rough estimate as to when the U5 kernel will be available? I'd like to provide some visibility to my user-base as to when we can potentially have this resolved on all systems. Thanks, -charles.
It's soon, but I'll have to defer that question to the RHEL3 maintainer (Ernie Petrides) who is on the cc list...
The current U5 beta kernel is version 2.4.21-32.EL, and it is currently scheduled for official release on 18-May-2005 (next Wednesday).
My mistake -- the current RHEL3-U5 candidate (2.4.21-32.EL) does not have the fix I referred to in comment #32.
Hello, Phil and Charles. We have finally resolved the problem of a very obscure PTE race condition that could cause arbitrary memory corruption (in either user or kernel data) on SMP x86 systems with greater than 4G of memory (with either the smp or hugemem configs). We have already verified definitively that our fix corrects two open bugzilla reports (151865 and 156023), and we are very interested if this fix resolves the data corruption problems you have encountered. The fix has just been committed to the RHEL3 U6 patch pool this evening (in kernel version 2.4.21-32.4.EL). Although this is just an interim Engineering build for U6 development, I have made the following kernel RPMs available on my "people" page for you to test: http://people.redhat.com/~petrides/.pte_race/kernel-hugemem-2.4.21-32.4.EL.i686.rpm http://people.redhat.com/~petrides/.pte_race/kernel-hugemem-unsupported-2.4.21-32.4.EL.i686.rpm http://people.redhat.com/~petrides/.pte_race/kernel-smp-2.4.21-32.4.EL.i686.rpm http://people.redhat.com/~petrides/.pte_race/kernel-smp-unsupported-2.4.21-32.4.EL.i686.rpm http://people.redhat.com/~petrides/.pte_race/kernel-source-2.4.21-32.4.EL.i386.rpm We intend to incorporate this bug fix in our next post-U5 security errata release (which has not yet been built, but which is likely to enter formal Q/A as soon as U5 is shipped on Wednesday, May 18th). In the meantime, please test this interim U6 build as soon as is practical so that we can determine if the fix addresses this bugzilla. Thanks in advance. -ernie
The fix (referred to above) for a data corruption problem has also just been committed to the RHEL3 E6 patch pool (in kernel version 2.4.21-32.0.1.EL). I'm tentatively moving this BZ to MODIFIED state and will list it in the associated security advisory that we intend to release post-U5. Please still follow up with test results as requested in comment #43. Thanks. -ernie
Thanks, Ernie -- we'll load this up and test. It sounds like this issue may also be responsible for the random JDK core dumps we experience on our x86 systems with >4GB RAM. I've had a tough time keeping our developers from crawling back to their Solaris boxes, between the panics and JDK core dumps. We don't really see the issue on x86_64 systems, which makes sense, apparently. Thanks, -charles.
Charles, I've added a detailed explanation of the problem in bug 141394 comments 129 and 130. The info you just provided correlates perfectly with our understanding of the bug, so I'm very hopeful that our fix will resolve the problems you encountered. The release of the 2.4.21-32.0.1.EL kernel is tentatively targeted for the end of next week. (The -32.4.EL pre-U6 kernel I've made available to you is a strict superset of the -32.0.1.EL security errata kernel.)
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-472.html
Does this issue also apply to x86_64 systems? Thanks, -charles.
No -- it's only on 32-bit x86 machines with PAE enabled, which means that they utilize 64-bit PTEs. x86_64 machines update their 64-bit PTEs atomically with one memory access.