Description of problem: Apparent memory corruption with kernel 2.4.21-25.ELhugemem (also occurred under 2.4.21-20.ELhugemem). Version-Release number of selected component (if applicable): kernel-hugemem-2.4.21-25.EL How reproducible: Occurs randomly. Additional info: RHEL3 kernels have exhibited a problem in which portions of memory appear to become corrupted. We first ran into this problem on 11/5/2004 on a system running 2.4.21-20.ELhugemem, and saw it again today on a different (identically configured) system running 2.4.21- 25EL. The servers in question host Oracle databases. These are the details for each event: --------------------- 11/5/2004: We began to see internal Oracle errors occurring on one datafile of a database. However, the Oracle DBVERIFY utility showed that there was no corruption in the datafile on disk. Reviewing the system logs showed the following error (which occurred just before we started seeing the Oracle internal errors): Nov 5 07:55:42 servername kernel: memory.c:189: bad pmd d151aff8 (2f00000000000000). This made us suspect either bad memory or a kernel bug, so we shut down the database and brought it back up on an identically configured server--which made the problem disappear. In retrospect it's possible that the "bad pmd" error was not the cause of the problem, but rather a result of it (or was unrelated). 11/30/2004: We saw an anomalous error from tripwire--two files were marked as modified, despite the fact that every ancillary check I could do on the files verified that they had NOT changed, and that they were still identical to the same two files on the companion system. Eventually this problem simply went away on its own. At about the same time we also began to see a problem with Oracle that was identical to the problem we'd seen on 11/5/2004--internal Oracle errors on one datafile in the database. In this case, though, the DBVERIFY utility *did* show corruption in the datafile in question. However, this datafile is hosted on NFS, and running DBVERIFY on the other database server against the exact same datafile showed NO errors. And after an hour or so, running DBVERIFY on the original server again showed no errors--so the error had "corrected" itself, just as the tripwire error had done. There were no apparent errors in the system logs related to this issue. Also, the system had been rebooted just two days before, and had been running the -25.EL kernel for about two weeks. --------------------- So a few points about these problems: 1) They occurred on two different but identically-configured servers 2) They affected both Oracle and tripwire 3) They involved both NFS-hosted and local files 4) Manifestations of the problem went away without any action on our part (on 11/30/2004) 5) They occurred on two different versions of the RHEL3 kernel This would seem to rule out the problem being caused by hardware or an Oracle bug, or being specifically related to either NFS or a local filesystem. It may be that the root problem is corruption of cached filesystem data, and so the resulting problems "correct" themselves when the filesystem data in question gets flushed from memory. I'll attach output of AltSysrq-m, -w, and -t as well as /proc/slabinfo from the server on which we're currently seeing the problem. If you need additional information, please let me know ASAP, since we'll be moving the Oracle database to the other server tonight to clear this issue.
Created attachment 107655 [details] Output of AltSysrq-m, -w, and -t @ 11:44am on 11/30/2004 Output of AltSysrq-m, -w, and -t @ 11:44am on 11/30/2004
Created attachment 107656 [details] Contents of /proc/slabinfo @ 11:42am on 11/30/2004
John, are you using the e100 network driver? We have found some corruption related to that during the U4 beta and have rolled it back to the version in U3. As for the -20.EL kernel, I am aware of one (admittedly very obscure, I am not aware of anybody having triggered it outside of a stress test on a 16 way SMP system) VM bug, which got fixed for U4.
Nope, not using e100 (but we are using e1000).
John, do you know if the corruption was in an NFS mounted file or was it in a locally mounted file? The reason I am asking is the fact that the corruption does not make it back to the file on disk and the fact that it "corrected itself" is indicitive of corruption in the pagecache. This would cause it to show up while the corrupted page was still in the cache but once it was displaced and subsequently re-read the corruption would be gone. Larry Woodman
I'd agree that pagecache corruption seems like a likely explanation-- that was our assumption. The incident this time around involved both an NFS-mounted file and two local files (as I mentioned in the longish initial comment). The first incident (on 11/5) involved only an NFS-mounted file, but that's not surprising on this system, since the database is NFS-mounted and is by far the lion's share of filesystem data in the working set. Also, with the local files, tripwire reported changes *only* in the CRC32 and MD5 properties--not any of the file times or other file metadata. Which would indicate that it was just a result of tripwire reading the corrupted data from the pagecache. And as I said, these tripwire errors just went away on their own after a few hours.
John, I am building a kernel with slab debugging turned on to see if we can catch someone using a kernel data structure after its freed or accessing beyond the bounds of a piece of kernel allocated memory. The bad_pmd message with the extraneous "2f000..." doesnt sound very good either so we might catch whoever did that as well. Are you willing to try out a kernel with slab debugging turned on? Also, how long did it take for these errors to show up? Is that version of Oracle using direct_IO or AIO? Larry
John, hugemem and smp kernels with slab debugging turned on are located here: >>>http://people.redhat.com/~lwoodman/.for_debug_purposes_only/ When you get a chance, please try to reproduce this corruption with both kernels so we can also determine if the different address space layout between 4G/4G and 3G/1G has anything to do with this problem as well. Thanks, Larry Woodman
John, another request: Is it possible to get the actual corruption that shows up from Oracle and Trip-Wire? Its slightly possible that it will be obvious as to what caused the corruption that you are seeing. Larry
Some answers: - We're not using direct or async I/O with this Oracle installation. - The errors showed up after two days of uptime in this case, but before that these two systems had been running the database for 25 days on various kernels (-20 and higher) without any apparent corruption issues. Also, they'd been running on -20.EL for over a month before the corruption issue on 11/5/2004. We have no idea how to reproduce it--it's completely random. - Also, it only happens on our production database server (with the hugemem kernel), so there's no way for us to try to reproduce it for you on an SMP kernel. In fact we're running the -25.ELsmp kernel on our dev/QA database servers, and we haven't seen the problem there yet, though it's possible that it's happened but wasn't noticed. - Unfortunately I don't have any way at this point to get you copies of the actual corrupted data. If the tripwire issue comes up again that would be a great way to catch it, assuming I can get a copy of the corrupted version of the file before it gets flushed from the pagecache (as apparently happened this time). Regarding the kernel with slab debugging: we're willing to try it, but what will be the effects of running that kernel? These are production database servers, so if the debug kernel would significantly affect performance or cause other operational issues we couldn't risk it.
Larry, you may have missed that last question: what will be the effects of running the test kernel? These are production database servers, so if the debug kernel would significantly affect performance or cause other operational issues we couldn't risk it. This Sunday is our window to put the new kernel in place, so if you could let me know ASAP I'd appreciate it.
Also, what do we need to do to get you the data you need when we see the problem happening? We can't leave the system in the corrupt state--this is a production system, and the corruption is directly affecting our users. So it's best if you tell me exactly how to collect the slab debug info you need when the problem is happening. It recurred today just after 4pm, BTW, but it's gone already.
Sorry, I missed the update to this bug period! There is a slight performance degradation associated with slab debug. Basically it writes patterns to the kernel data structures as it frees then so it can detect corruption outside of allocated memory. I think you should expect to see about 5% performance degradation running Oracle. If the kernel detects any corruption it simply crashes and thats all we need. If you can get me the exact data that was corrupt it might provide some hint as to where the corruption came from. Larry
Ack--crashing is bad. This is a production database. This will immediately affect all our customers, rather than just the ones whose data is corrupted. And are you saying you'd get data from the crash somehow? Or is it just that the fact that it crashed would tell you that it was indeed slab corruption that had caused the problem? If that's all it is, it'd be worlds better if we could use a kernel that had some less violent way of signalling that it had found the error :-). A file in /proc whose contents change from 0 to 1, or something, say.
This system just crashed (possibly due to the same bug, possibly due to a different bug). Atypically, this one did leave a minor trace of what happened, in /var/log/kernel: Dec 5 14:39:07 servername kernel: Page has mapping still set. This is a serious situation. However if you That was the full message. Given what the message says, I thought it might be related to this problem.
This is the whole message, its from freeing a page that has the mapping still set. Page has mapping still set. This is a serious situation. However if you are using the NVidia binary only module please report this bug to NVidia and not to the linux kernel mailinglist. This was part of a BUG, the entire traceback would have told us exactly who freed the page in that state. Do you have serial consoles attached to these machines? If not can you set one up? Any yes, it might be related to the origional crash but we need the traceback to taeel for sure. Larry
Larry, please also respond to note 14. We can't use the slabdebug kernel if it's going to crash the server. In this case the crash was followed by a reboot, so the serial console wouldn't have helped. And yes, we do have serial consoles, but these are IBM Bladecenters where the serial console is accessed via a VNC-based Java applet, so you get just one screen's worth of output...and thanks to Linux's console screen blanking code, you usually can't see a thing anyway. I've now added escape sequences to /etc/issue to circumvent the screen blanking code and have also set up netdump to log to our syslog server, so we'll see if that improves things. I take it the AltSysrq and /proc/slabinfo output I attached to the case didn't help? BTW, there is no "original crash" here; the original problem is memory corruption. The reason I posted note 15 was because it seemed possible that the "Page has mapping still set" message was actually a bogus error, caused by memory corruption in some page that triggered that failure path in the kernel.
The slab debug kernel will crash the server but only if it detects fatal memory corruption. In other words it will be crashing anyway(or even worse silently corrupting data) but with slab debugging torned on we will be crashing a bit earlier, before the consequential damage occurs. Larry
The three times we've seen this memory corruption, though, it *hasn't* crashed the server. Silently corrupting data is less awful than having the entire system crash in this case, because if it happens within Oracle's memory space Oracle will notice it and throw errors; the result is that a few customers may be affected, but not all of them. Crashing is far worse, because all users are screwed for several minutes at the least. So unless the slab debugging kernel can be made to signal its results in a less drastic way, we can't use it on this server.
FYI, we've fallen back to the SMP kernel on this system to see if that resolves the issue (which is painful since it requires us to trim the Oracle SGA significantly). So far we haven't seen any recurrences of the corruption problems, but it's only been a few days now.
So: today we hit the memory corruption bug on a different server, which is running the 2.4.21-25.ELsmp kernel. This is a web server development system, and it shares no common user programs with the database servers on which we've seen the problem previously. This is the same server from bug 141905, so it looks as though that may indeed be a duplicate of this bug (or at least I'm fine with treating it that way until this bug is resolved). This means that this bug is NOT specific to the hugemem kernel, nor is it related to Oracle. We caught the bug today via tripwire again--a "change" in /usr/X11R6/bin/Xvfb (from XFree86-Xvfb-4.3.0-68.EL). I have a copy of the corrupted file if you want it, but here's the full cmp -l output (produced on a different server): # cmp -l /usr/X11R6/bin/Xvfb Xvfb.CORRUPT 1334321 10 2 1334322 213 0 1334323 135 0 1334324 370 0 If you want me to attach either the corrupt or good versions of Xvfb to the case, let me know. I also have AltSysrq-m, -w, -t and /proc/slabinfo output if that's useful--let me know. And the server is still running and still has the corruption of Xvfb in memory at the moment (no guarantees on how long that'll last though), so if you want me to get further debugging info from it just tell me what it is you want. And this is a development system, so violent tests (i.e. ones that crash) would be acceptable.
*** Bug 141905 has been marked as a duplicate of this bug. ***
[ Oops--just added this to bug 141905. Here you go. ] We just experienced a kernel panic on the database server of this pair which was NOT running the database--in other words, it was sitting idle except for VCS and periodic tripwire runs. Since it's possible that this was caused by the memory corruption bug, I'll give you the info here--but if I'm wrong about that, just say so and I'll file yet another bug. Here's the panic info (we don't have a memory dump for it): ---------------------------------------------------- Unable to handle kernel NULL pointer dereference at virtual address 0000002d printing eip: 021491e4 *pde = 00003001 *pte = 00000000 Oops: 0000 nfs lockd sunrpc gab llt netconsole autofs4 audit tg3 e1000 sg sr_mod cdrom usb-storage keybdev mousedev hid input usb-ohci usbcore ext3 jbd mptscsih mptbase CPU: 3 EIP: 0060:[<021491e4>] Tainted: PF EFLAGS: 00010206 EIP is at do_generic_file_read [kernel] 0x174 (2.4.21- 25.ELhugemem/i686) eax: 0000001d ebx: 00000016 ecx: 1312b680 edx: 0000001d esi: dfb4e1c4 edi: 12ed2c94 ebp: 000000de esp: cea33ef4 ds: 0068 es: 0068 ss: 0068 Process tripwire (pid: 25603, stackpage=cea33000) Stack: dfb4e100 08208590 00000000 00001000 00000000 00001000 00000000 00000000 00000000 dfb4e100 fffffff2 00001000 df368d80 ffffffea 00001000 02149e35 df368d80 df368da0 cea33f5c 02149c80 00000000 02439680 00002710 cea32000 Call Trace: [<02149e35>] generic_file_new_read [kernel] 0xc5 (0xcea33f30) [<02149c80>] file_read_actor [kernel] 0x0 (0xcea33f40) [<02149f5f>] generic_file_read [kernel] 0x2f (0xcea33f7c) [<02164ea3>] sys_read [kernel] 0xa3 (0xcea33f94) Code: Bad EIP value. CPU#0 is frozen. CPU#1 is frozen. CPU#2 is frozen. CPU#3 is executing netdump. CPU#4 is frozen. CPU#5 is frozen. CPU#6 is frozen. CPU#7 is frozen.
I've added a fair amount of info to this case (an example of the data corruption, another panic trace), but haven't heard any word from Redhat on this lately. What's the status? Also, I see that one of the fixes in the the U4 kernel release is "Data corruption on RHEL3 U4 beta -26.EL kernel," but I can't view the associated bug (bug 140022). Does this bug have anything to do with the one I've reported here? BTW, from our expeience so far with the SMP kernel on various servers, it appears that the data corruption happens less frequently with the SMP kernel than with the hugemem kernel--though it does happen with both of them. Don't know if this helps.
John, Please try to reproduce this problem with the official RHEL3-U4 kernel. Its now available on RHN. Larry Woodman
We'll try to do that (I admit I'm surprised Redhat would release U4 with a known data corruption bug in it). I'd appreciate answers to my questions, though.
Update: today the system from comment 21 crashed again, while running the -26.slabdebug kernel that you'd provided. And we're in the process of rolling out the U4/-27 kernel to all RHEL3 systems. To reiterate my questions: 1) You asked for an example of the actual corruption, and I gave you one (Xvfb, in comment 21). Did that info tell you anything? 2) I also gave you yet another kernel panic trace in comment 23. Did that tell you anything? 3) A bug identified as "Data corruption on RHEL3 U4 beta -26.EL kernel" (bug 140022) was listed as fixed in the U4 kernel release summary. What is that bug, and is it the same as the I've reported here?
John: 1.) the Xvfb file diffs really didnt tell me anything. Evidently a few(3) bytes of that file are different than the origional. Do you know if this data is NFS mounted and if its different on the server and the client? 2.) The panic in do_generic_file_read is due to the page->->next_hash containing a 0x1d instead of a valid page pointer. This is corruption in the page struct in the mem_map array. 3.) The Data corruption in the -26.EL kernel(140022) was due to a bug in the latest e100 driver. We did fix that bug in -27.EL but you said that you are not using e100, right? So far, this is the only report of corruption outside of the e100 driver. Any help in reproducing this problem would be extremely helpful. Larry
Thanks for the responses. The Xvfb file was on a local filesystem (ext3). The cmp -l output I sent you reflected the corrupted version on one system compared to a known-good version from another system (and after a reboot, the version of the file on the system where the corruption had occurred reverted to its original form). That was the full extent of the corruption--4 bytes on a 16-byte boundary. And nope, we don't use the e100 driver on these systems. In fact the system from comment 21 doesn't even use the e1000 driver (though our database servers do). We'd love to give you a way to reproduce this...we'd love to have one ourselves. But the behavior so far is just random. I'm guessing that you've not had other reports of this problem because it's very difficult to diagnose correctly; if it corrupts normal file data it may simply go unnoticed, and if it causes a crash, the crash may be written off as something else or may be misdiagnosed as having some other cause. We were only able to determine that it was random memory corruption thanks to the fact that tripwire MD5/CRC values were changing for files when no other file metadata changes were being flagged, and Oracle was reporting corrupted data blocks even though the Oracle DBVERIFY utility didn't see corruption when run against that same (NFS-mounted) file from another system. We're now running -27 everywhere except the production database servers, and we'll swap those over to -27 this weekend. We're also running tripwire in a tight loop on our quiescent database server, to see if it'll flush out the error.
Corruption again on one of the database servers, detected by tripwire reporting only MD5/CRC32 changes, this time in a static text file (/usr/share/ImageMagick/www/api/types/ImageAttribute.html). Here's the cmp -l output for a good version vs. the corrupted version: # cmp -l /usr/share/ImageMagick/www/api/types/ImageAttribute.html ImageAttribute.html.CORRUPT 1177 144 150 1178 164 161 1179 150 5 1180 75 11 Again, just 4 bytes have changed, though this time on an 8-byte boundary. So it looks like some memory management routine may have an off by one error (one word, that is). This system is still running the -25 hugemem kernel (it can't be updated until this weekend, during a downtime window). We've definitely got strong evidence now that the problem happens far more frequently with the hugemem kernel than with the SMP kernel; we'd gone for two weeks with no recurrences on the SMP kernel, but after switching back to hugemem this weekend we've already seen two occurences of corruption (including this one). And this is the same pattern we've seen in the past with the hugemem kernel.
Corruption again on one of the database servers, detected by tripwire reporting only MD5/CRC32 changes, this time in /lib/ssa/gcc-lib/i386- redhat-linux-gnu/3.5-tree-ssa/libgcj.a. Here's the cmp -l output for a good version vs. the corrupted version: # cmp -l /lib/ssa/gcc-lib/i386-redhat-linux-gnu/3.5-tree-ssa/libgcj.a libgcj.a.CORRUPT 19040081 0 2 19040083 100 0 19040084 1 0 Note that it's only reporting three bytes of difference this time...but the odd byte out (19040082) is a null, so it's likely that the corrupted version and the original version just coincidentally both had nulls in that byte position. This was still on 25.ELhugemem.
We now have a bug on the -27 (hugemem) kernel as well, though I can't be certain if it's caused by the same corruption issue. I've been running tripwire in a loop on our quiescent database server, and a few hours ago it started hanging on the file /usr/share/doc/4Suite- 0.11.1/demos/4ODS/tutorial/blob_test.py. At this point, any attempt to read this file causes the calling process to hang (unkillably). Here's an strace of an attempted access: # strace -f cat /usr/share/doc/4Suite- 0.11.1/demos/4ODS/tutorial/blob_test.py execve("/bin/cat", ["cat", "/usr/share/doc/4Suite- 0.11.1/demos/4ODS/tutorial/blob_test.py"], [/* 29 vars */]) = 0 uname({sys="Linux", node="bom-db02", ...}) = 0 brk(0) = 0x9bce000 open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=36895, ...}) = 0 old_mmap(NULL, 36895, PROT_READ, MAP_PRIVATE, 3, 0) = 0xf65f1000 close(3) = 0 open("/lib/tls/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\200X\1"..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0755, st_size=1568924, ...}) = 0 old_mmap(NULL, 1276876, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0xc6a000 old_mmap(0xd9c000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x131000) = 0xd9c000 old_mmap(0xda0000, 7116, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xda0000 close(3) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf65f0000 set_thread_area({entry_number:-1 -> 6, base_addr:0xf65f0520, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, l imit_in_pages:1, seg_not_present:0, useable:1}) = 0 munmap(0xf65f1000, 36895) = 0 open("/usr/lib/locale/locale-archive", O_RDONLY|O_LARGEFILE) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=32148976, ...}) = 0 mmap2(NULL, 2097152, PROT_READ, MAP_PRIVATE, 3, 0) = 0xf63f0000 mmap2(NULL, 204800, PROT_READ, MAP_PRIVATE, 3, 0x9c4) = 0xf63be000 brk(0) = 0x9bce000 brk(0x9bef000) = 0x9bef000 brk(0) = 0x9bef000 mmap2(NULL, 4096, PROT_READ, MAP_PRIVATE, 3, 0xa12) = 0xf63bd000 close(3) = 0 fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 3), ...}) = 0 open("/usr/share/doc/4Suite-0.11.1/demos/4ODS/tutorial/blob_test.py", O_RDONLY|O_LARGEFILE) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=3190, ...}) = 0 read(3, That's the full output--the process hangs at that point. I'm reporting it as part of this bug because it seems possible that the memory corruption bug has affected some in-memory structure associated with this file, and that's causing problems for processes trying to access the file. I'll leave the system in this state for the moment; please let me know ASAP if there's any information you'd like to see that might help to identify what's going on here.
I know you all may be on vacation, but if not I really need to know as soon as possible what you'd like me to do about the system on which reading of the file /usr/share/doc/4Suite- 0.11.1/demos/4ODS/tutorial/blob_test.py is causing processes to hang (as per comment 32). It's an inactive system for the moment, but that can change at any time, and I don't want to leave it in this state any longer than necessary. BTW, every process that's currently hung while accessing this file shows up in a ps listing with a state code of "D" (uninterruptible sleep), and each one is also permanently counted in the load average.
Hi John, yes most of us are on vacation this week. Please get me an AltSysrq-T output so I can see internally why the processes are in D state. Thanks, Larry
Created attachment 109166 [details] AltSysrq-T output AltSysrq-T output from the system on which various processes (cat, less, tripwire) are hung while trying to read /usr/share/doc/4Suite- 0.11.1/demos/4ODS/tutorial/blob_test.py.
Just to narrow things down a bit more, is the bug reproducible without syscall auditing enabled? IIRC the syscall auditing code has had problems (and might still have, I'm not sure), so it would be good to find out if the bug can also be reproduced without syscall auditing. Also, what is the "llt" driver? I don't think I have seen that one before...
Yep, I can easily hang any process just by trying to open and read that file (which is a local file, BTW). The only reason a few of them have syscall in the stack traces is because I either ran them with strace or strace'd an already-running process, to try to see what was happening. LLT (low-level transport) is a part of Veritas Cluster Server, which runs on this server and its companion server. If there's any debugging you want me to do, this system is basically wide open for testing (intrusive or otherwise).
Alas, spoke too soon. I started one more cat process (straight, no strace/ltrace), then went to get another AltSysrq-T output...and unfortunately, that hung the entire machine. Here was the last bit of output on the console: bash R current 1728 10624 10623 (NOTLB) Call Trace: [<f8debb60>] netconsole [netconsole] 0x0 (0xa9a37e84) [<021299af>] __call_console_drivers [kernel] 0x5f (0xa9a37e94) [<f8debb60>] netconsole [netconsole] 0x0 (0xa9a37e98) [<02129ab3>] call_console_drivers [kernel] 0x63 (0xa9a37eb0) [<02129de1>] printk [kernel] 0x151 (0xa9a37ee8) [<02129de1>] printk [kernel] 0x151 (0xa9a37efc) [<0210c8a9>] show_trace [kernel] 0xd9 (0xa9a37f08) [<0210c8a9>] show_trace [kernel] 0xd9 (0xa9a37f10) [<02126032>] show_state [kernel] 0x62 (0xa9a37f20) [<021cffba>] __handle_sysrq_nolock [kernel] 0x7a (0xa9a37f34) [<021cff1d>] handle_sysrq [kernel] 0x5d (0xa9a37f54) [<02199a2c>] write_sysrq_trigger [kernel] (0xa9a37f78) [<02165123>] sys_write [kernel] 0xa3 (0xa9a37f94) The system was toast at this point, so the only thing I could do was powercycle it. You'll doubtless be unsurprised to hear that after the reboot, I can now cat /usr/share/doc/4Suite-0.11.1/demos/4ODS/tutorial/blob_test.py to my heart's content, without having it hang.
Verified: the U4 kernel also has this bug. Detected this time by tripwire reporting only MD5/CRC32 changes on our primary development server (2.4.21-27.ELhugemem, no Oracle, no VCS), this time in the file /lib/ssa/gcc-lib/i386-redhat-linux-gnu/3.5-tree-ssa/jc1. Here's the cmp -l output for a good version vs. the corrupted version: # cmp -l /lib/ssa/gcc-lib/i386-redhat-linux-gnu/3.5-tree-ssa/jc1 jc1.CORRUPT 3587121 114 2 3587122 135 0 3587123 311 0 3587124 377 0 So the corruption does seem to be consistently just 4 bytes (and always at least word-aligned). The file is still in this state on this server--i.e. the in-memory copy is still corrupted--so if there's anything you want us to do to get more info, now's the time.
is this using serverworks IDE for storage ?
No. The servers that have exhibited the data corruption bug are all IBM blades--two HS40s (8839) and one HS20 (8832). The HS40s use SCSI drives via an IBM SCSI daughter card, and the HS20 uses IDE drives that use the HS20's onboard IDE controller. Also, the HS20 doesn't run Oracle or VCS. All three servers do mount various volumes via NFS, though--but the corruption has shown up both on NFS-mounted files and on local files, and the corruption in comment 40 was associated with a local file.
Another instance of the corruption bug on 2.4.21-27.ELhugemem, this time discovered by Oracle rather than tripwire (on one of the HS40s). We resolved this instance of corruption, but the corruption I mentioned in comment 40 is still there. I'm happy to do whatever you need in that case to get you the info you need to fix this problem.
We had two extremely harmful system interruptions today, presumably both caused by this issue. The second led to a system lockup and was accompanied by the following info in the netdump log: Unable to handle kernel NULL pointer dereference at virtual address 00000032 printing eip: 02153aef *pde = 00003001 *pte = 00000000 Unfortunately that's all we got from these two problems. What's the status on all this? We're at the point where we're seriously looking at migrating all our servers to another hardware/OS platform, despite the huge amount of effort involved, because with this bug RHEL3 is unsuitable for serious use. I've done everything you've asked and provided every kind of info I can--is it helping at all? Is there anything else you need? If so, what? I really cannot possibly overstate the criticality of this bug.
John, do you have the full console output/stack traceback for the last "Unable to handle kernel NULL pointer dereference at virtual address 00000032" you recieved? Larry Woodman
John, is this reproducable without the LLT part of Veritas Cluster Server running? Larry Woodman
As noted in comment 45, the output I included was all we got from those two problems. There was nothing else on the console, on the netdump server, or in the system logs (local or remote). Also, as noted in comment 40 and comment 42, one of the servers on which we've been seeing the corruption does not run VCS (of which LLT is a part)--so yes, it happens without LLT. The most notable commonality between the HS20 and the two HS40s is the use of NFS--though as I've mentioned, the problems occur both with NFS-mounted files and local files.
Why is this case still marked NEEDINFO? I've answered Larry's questions from comment 46 and comment 47. For our part, I'd appreciate a response to the following (from comment 44): What's the status on all this? We're at the point where we're seriously looking at migrating all our servers to another hardware/OS platform, despite the huge amount of effort involved, because with this bug RHEL3 is unsuitable for serious use. I've done everything you've asked and provided every kind of info I can--is it helping at all? Is there anything else you need? If so, what? I really cannot possibly overstate the criticality of this bug.
Hello, John. I put this BZ into NEEDINFO state after Larry posted his question in comment #47. For some reason, this didn't get reverted to ASSIGNED state when you provided the answer in comment #48, but the state is now ASSIGNED. Let me reassure you that this problem is being given our utmost attention. Two of our top engineers are assigned to this case full-time, and several others (including me) have been running experiments to try to narrow down the problem space. I'll let you know as soon as we understand the problem well enough to propose and test a possible fix. Thanks again for all your assistance.
Ok, great, thanks for the update. I suspected that it staying in NEEDINFO might have been a glitch of some sort.... I'll be waiting for word on what we can do. Believe me, we're highly motivated. :-)
We are, too! (We can now reproduce a data structure corruption in about half an hour, which we believe is related to the problem you're seeing.)
Hi, John. We believe we have found the source of data corruption, which results from any access to /proc/kcore when the number of vmalloc'd regions in the kernel is in a specific range. I'm in the process of verifying the fix. I would like to build you a test kernel based on the latest released RHEL3 kernel (2.4.21-27.0.2.EL, which is U5 + two security updates). Which RPM(s) do you need to verify that our fix resolves the data corruption problem that you reported? (Which config do you want?)
Created attachment 110040 [details] proposed fixes for memory corruption cause by /proc/kcore access This is the /proc/kcore patch that is currently being built into externally releasable test kernels. The first patch hunk is the critical fix for the ELF header buffer sizing problem. These fixes have been verified to fix potential data corruption due to /proc/kcore under certain memory conditions in local testing.
Both smp and hugemem would be good. We've found that the corruption errors occur much more frequently with hugemem, so I suppose if we want to shake this out quickly that's what we'll go with. Has that been the case in your testing as well (hugemem is more fragile than smp)? It might also help if you could provide us with your testing methodology, so we can see if it seems to reproduce the problem in our environment as well.
John, the following test kernels (plus the kernel-source RPM) are available under my Red Hat people page here: http://people.redhat.com/~petrides/.kcore/kernel-2.4.21-27.0.2.EL.ernie.kcore.1.i686.rpm http://people.redhat.com/~petrides/.kcore/kernel-smp-2.4.21-27.0.2.EL.ernie.kcore.1.i686.rpm http://people.redhat.com/~petrides/.kcore/kernel-hugemem-2.4.21-27.0.2.EL.ernie.kcore.1.i686.rpm http://people.redhat.com/~petrides/.kcore/kernel-source-2.4.21-27.0.2.EL.ernie.kcore.1.i386.rpm Please let us know whether this proposed fix resolves the data corruption problem you've encountered. If you need a different RPM, just list that here and one of us will make it available to you. Thanks in advance. -ernie
Ah, missed your comment and the BZ state change just now. In answer to your question, I was easily able to reproduce a corruption with the i686 smp config kernel on a system booted with "maxcpus=1 mem=384m" in the kernel boot args. My test case was using "tar" from /proc/kcore.
Ok, in a testament to our motivation, our production database server is already running this test kernel--and the hugemem version to boot, to give it the best chance to fail, if that's what it's going to do. Not that I *want* that to happen, mind you.... If it croaks, burps, or so much as hiccups, you will of course be the first to know.
Hi John, has the test kernel solved your problem? Have you had any data corruptions since installing it? Regards, Peter
We've not had any instances of corruption yet on the new kernel. However, that doesn't mean much; it will take weeks before we can say with any confidence that this kernel is actually addressing the problem (the SMP kernel previously went without failure for over a month, and the hugemem kernel went for nearly that long).
Hi, John. I'm going to commit our /proc/kcore memory corruption fix to U5 later this evening. Since we haven't heard of any recent crashes on your systems since you've been running the test kernel, we're hoping that this problem was the root cause of your instances of memory corruption. I'll be transitioning this BZ to MODIFIED once I do the commit. But if you do encounter another crash while continuing to run the test kernel, please put this BZ back into ASSIGNED state so that we'd know it still needs work.
A fix for the /proc/kcore memory corruption bug, which we believe is the root cause of this problem, has just been committed to the RHEL3 U5 patch pool this evening (in kernel version 2.4.21-27.10.EL).
Created attachment 110473 [details] We're not out of the woods yet The database server running the 2.4.21-27.0.2.EL.ernie.kcore.1hugemem kernel crashed unexpectedly today at 4pm. I'm attaching the complete "log" file generated on the netdump server; this is all that there was, other than a zero-length file called vmcore-incomplete (netdump isn't all it's cracked up to be on RHEL3, unfortunately).
Ugh, bad news. But that's an oops in rt_check_expire__thr(), which is not part of the RHEL3 kernel. Is that in the module that tainted your kernel?
VCS loads modules which taint the kernel, I believe, but I don't know whether or not that particular routine is part of the VCS code. If it's still the memory corruption bug, though, it could hit anywhere at all. Also, just as a reminder: we've had numerous instances of the corruption bug in the past on a system that does not run VCS (or Oracle), and which in fact runs a pretty plain vanilla application mix (NFS + web services).
Understood, John. Well, we have fixed a memory corruption bug caused by /proc/kcore access (which is the only one we could reproduce), but it seems that you are encountering a different problem. So, I'm changing this back out of MODIFIED state. Since we don't have access to the source code that oops'ed, you should try to get the appropriate 3rd party involved in the debugging. Please let us know whether you can get any clues. Also, please continue to run the kcore-fix kernel(s) to eliminate that as a possible culprit in future crashes. And please also try to collect a crash dump if possible so that we have something to go on. I'll bounce this back to DaveA and leave it in NEEDINFO state until we have something to work with.
The system that crashed runs VCS, Oracle, and nothing else beyond standard RHEL3 stuff. A google search on "rt_check_expire__thr" shows a lot of references to it in contexts that don't particularly look like they'd be associated with VCS (and in which I don't see any references to llt or gab, the two main kernel modules loaded by VCS). A system-wide grep finds it only in these files: boot/System.map-2.4.21-15.0.4.ELhugemem boot/System.map-2.4.21-25.ELhugemem boot/vmlinux-2.4.21-25.ELhugemem boot/System.map-2.4.21-25.ELsmp boot/vmlinux-2.4.21-25.ELsmp boot/vmlinux-2.4.21-15.0.4.ELhugemem boot/System.map-2.4.21-20.ELhugemem boot/vmlinux-2.4.21-20.ELhugemem boot/System.map-2.4.21-27.ELhugemem boot/vmlinux-2.4.21-27.ELhugemem boot/System.map-2.4.21-27.ELsmp boot/vmlinux-2.4.21-27.ELsmp boot/System.map-2.4.21-27.0.2.EL.ernie.kcore.1hugemem boot/vmlinux-2.4.21-27.0.2.EL.ernie.kcore.1hugemem Are you sure this isn't a standard kernel routine? Doesn't the fact that that string is showing up in the vmlinux files (and nowhere else outside of /boot) imply otherwise? In any case, I don't know who else to involve. If there's info I can give you from these systems that might tell you more about it, though, feel free to ask. Or if you can tell me exactly what you want me to ask Veritas about this routine, I can run it past them and see if they know about it. As for trying to get a crash dump, we are (and have been) running netdump, but while we do occasionally get crash logs, it's been a complete bust when it comes to producing crash dumps. If there's some other way to get them that you can suggest, that'd be great.
John, you are correct, rt_check_expire__thr is a standard kernel routine. It's hand-crafted from this in include/linux/interrupt.h: #define SMP_TIMER_NAME(name) name##__thr and found in net/ipv4/route.c; the oops is happening on the line indicated below: /* This runs via a timer and thus is always in BH context. */ static void SMP_TIMER_NAME(rt_check_expire)(unsigned long dummy) { static int rover; int i = rover, t; struct rtable *rth, **rthp; unsigned long now = jiffies; for (t = ip_rt_gc_interval << rt_hash_log; t >= 0; t -= ip_rt_gc_timeout) { unsigned long tmo = ip_rt_gc_timeout; i = (i + 1) & rt_hash_mask; rthp = &rt_hash_table[i].chain; write_lock(&rt_hash_table[i].lock); while ((rth = *rthp) != NULL) { ======================> if (rth->u.dst.expires) { /* Entry is expired even if it is in use */ if (time_before_eq(now, rth->u.dst.expires)) { tmo >>= 1; rthp = &rth->u.rt_next; continue; } } Unfortunately without a dump there's not much to go on. Is the netdump-server on the same subnet? If you run anything other than the hugemem kernel, getting at least 1GB saved into the vmcore-incomplete may be enough to work with, since it will gather all of lowmem, which contains all of the kernel static memory and slabcache memory. A vmcore-incomplete of a hugemem kernel would need uses most need to be at least 4GB in size.
The netdump server for this particular server is on the same subnet. We are setting NETDUMPADDR and SYSLOGADDR to different values, however (and the syslog server is on a different subnet); I could try unsetting SYSLOGADDR if you think it would help, since based on bug 142921 it would seem that RHEL3 netdump doesn't like this sort of config. We can run the SMP kernel if you'd prefer it for smaller crash dump purposes, but the catch there is that the SMP kernel seemed to hit the corruption much less often than the hugemem kernel. So in that case it could easily be 3-4 weeks before we'd get another corruption event. Since the main reason for choosing the SMP kernel would be netdump, I may try intentionally panic'ing a test server to see if I can get netdump to produce a crash dump under any circumstances.
If you don't need SYSLOGADDR, take it out of the equation, although I can't really give you a good reason why it would interfere with the netdump process. As far as which kernel to run, that's your choice. Note that even if the SMP kernel was used, and it was able to create a 1GB dumpfile, there's still a possibility crucial data will be in module memory, in which case, it would still be pretty much useless.
Based on my testing, the netdump server will accept memory dumps if a system running the SMP kernel crashes, but not if one running hugemem crashes--the best I can get is a zero-length vmcore-incomplete file (as we saw in the actual crash from comment 65). Setting SYSLOGADDR appears to have no effect, as you'd expect. So we can either 1) run the SMP kernel, have a chance at a crash dump, but possibly have to wait many weeks for a corruption event, or 2) run the hugemem kernel, have no chance at a crash dump, but hopefully have to wait less time in between corruption events. It no longer matters to us because we stopped utilizing hugemem's extended process address space early on in this bug's lifetime, since that kernel had proven too unstable for production use...and at this point we're not going to start using those features again until we know we have a fix. So we can run either kernel; it's your call, based on which one you think will give you a better chance of tracking down this bug.
Then by all means, use the SMP kernel. Getting a dump is of prime importance here.
Ok, this morning we've had several completely unambiguous instances of the memory corruption bug on our production database server, detected by Oracle throwing block corruption errors. So I can say for certain that this bug is not resolved. The server in question is on the 2.4.21- 27.0.2.EL.ernie.kcore.1hugemem kernel, but a memory dump isn't even an issue since in this case the corruption isn't causing a system crash (as it usually doesn't); it's just silently corrupting user data.
Created attachment 110582 [details] "log" output from another system crash This is the "log" file generated by another system crash--this time on our primary development server (so no VCS or Oracle), running the 2.4.21-27.0.2.EL.ernie.kcore.1smp kernel. There was also a zero-length vmcore-incomplete file--sorry. I at least consider it progress that we're even *getting* vmcore-incomplete files from netdump now, since they weren't showing up in the past at all.
FYI, we had another instance of memory corruption on a database server (detected by Oracle) while running the 2.4.21- 27.0.2.EL.ernie.kcore.1smp kernel. The server has since been rebooted. This is mainly news since it shows that the corruption bug still exists both in the hugemem and SMP kernels. I take it the "log" info from comment 76 didn't help much?
This is same panic scenario as seen in duplicate #141905, where Larry indicated: > BTW, this appears to be memory corruption in the mem_map(array of > page structs). > > The crash in page_referenced was caused by a bad page->pte.chain > value. > ---------------------------------------------------------------- > int page_referenced(struct page * page, int * rsslimit) > ... > for (pc = page->pte.chain; pc; pc = pte_chain_next(pc)) { > ... > chain_ptep_t pte_paddr = pc->ptes[i]; > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > The assembler code for this is: > 0xc015ff38 <page_referenced+0x2f8>: mov 0x4(%esi,%ebx,4),%eax > > where esi: -> bcb64118 >--------------------------------------------------------------------- > > This esi value can never be less than 0xc0000000! The difference in this case is that esi is 0xa5bf27c. It almost looks as if it were a pte_addr_t, but the page flags values wouldn't make sense. The difference is the esi
Corruption again on one of the database servers, detected by tripwire reporting only MD5/CRC32 changes, in the file /lib/modules/2.4.21- 27.0.2.EL.ernie.kcore.1hugemem/kernel/drivers/block/cpqarray.o. Here's the cmp -l output for a good version vs. the corrupted version: # cmp -l /lib/modules/2.4.21- 27.0.2.EL.ernie.kcore.1hugemem/kernel/drivers/block/cpqarray.o cpqarray.o.CORRUPT 22713 103 0 22714 157 0 22715 155 0 22716 160 0 Again, just 4 bytes have changed, and on an 8-byte boundary. And this is with the full complement of RHEL3 U4 RPMs.
FYI: another corruption instance on one of our production database servers, detected and reported by Oracle. This is the first corruption instance since the last one I reported, on 2/15/2005 (comment 79)...thus illustrating what I'd mentioned about the SMP kernel often taking a very long time to manifest this bug. The server in question had been running without any apparent corruption instances for 25 days (on 2.4.21-27.0.2.EL.ernie.kcore.1smp).
FYI: another corruption instance on one of our production database servers, detected and reported by Oracle. Still on 2.4.21-27.0.2.EL.ernie.kcore.1smp.
Thanks for the updates, John. Incidentally, the official RHEL U5 external beta period is now scheduled to start on Monday, 3/21. The kernel version is 2.4.21-31.EL, and it contains the /proc/kcore fix as well as lots of other critical fixes. We do not expect that the problem you've encountered has been resolved, but it would be a good idea to try the U5 beta kernel at some point (preferably in non-production use first).
Thanks, we'll do that (and let you know if we see any changes in this bug's behavior as a result).
An unexplained crash on one of the systems where we see the corruption bug, so my assumption is that that was the cause. As usual, netdump did not capture a core dump (vmcore-incomplete was present, but 0-length). I'll include the "log" file from netdump separately.
Created attachment 112152 [details] "log" output from yet another crash
In this case, an nfs_file_write() operation needed a new page to do so, called __alloc_pages(), which was in the act of reclaiming a page from the inactive_clean list. That particular page on the list had an inode mapping, but the list_head that should link it into the inode mapping's page list had a NULL page->list.prev pointer (at least). Hard to glean anything from that without a dump. Questions arising from internal discussions re: this case: Can this bug be reproduced with system call auditing enabled? Are both servers writing to the same disk array? Is the SCSI bus configured with parity enabled? Are any parity errors logged?
Answers: - If you want us to enable specific kinds of auditing, sure, we can do that (the systems use the default audit settings at present). As to whether or not the bug can be reproduced in that scenario, dunno, since we have no way to reproduce the bug at all other than waiting for it to happen. If you do want us to enable specific kinds of auditing, we'll need to know the expected performance impact as well. - There are three servers involved: two database servers (a redundant pair, only one of which is runs the database at any given time) and a development server. The database servers do not share any disk arrays--the database lives on NFS (on a Netapp box). The development server doesn't use Oracle at all. And again: the corruption has shown up in NFS-based files, local (ext3-based files), and in-memory kernel structures that aren't associated with any disks. - I don't see any way to check or change the parity settings for the SCSI bus on these servers (and no, we haven't seen any errors logged).
Actually, we'd prefer that auditing be turned completely off just to rule it out. And can you verify that you've turned off SYSLOGADDR in the client's /etc/sysconfig/netdump file? Also, just for sanity's sake, can you look at the /var/log/messages file on the netdump-server, and tell us what netdump-server error messages are there? There should be some kind of time-out error message. Does the client reboot, or just sit there after the "performing handshake" line?
Ok, we can turn off auditing. When you say this, are you talking about just killing off /sbin/auditd? We haven't commented out SYSLOGADDR on these clients, because based on our testing SYSLOGADDR appears to have no effect on whether or not a client can generate a successful crash dump (as per comment 73), and it's desirable to have it on since our syslog server and netdump server are two different servers. We can turn it off anyway if you think it's worthwhile anyway, though. For this particular event I do see some messages in /var/log/messages regarding the netdump: Mar 19 14:53:37 server netdump[1003]: No space for dump image Mar 19 14:53:38 server netdump[1003]: Got too many timeouts waiting for SHOW_STATUS for client 0x0a141e28, rebooting it Mar 19 14:53:42 server netdump[1003]: Got unexpected packet type 12 from ip 0x0a141e28 I think the "no space for dump image" may be a red herring because 1) the server in this instance had about 4GB free in /var/crash and 2) I've successfully forced a crash dump from a server with twice as much memory as one in this incident (16GB vs. 8GB) to a netdump server with about the same amount of disk space available in /var/crash as the one in this incident. However, if you think 4GB free is insufficient, I'll try symlinking /var/crash to a larger partition on this particular netdump server. How much space does netdump actually require to dump a system with 16GB of RAM, or one with 8GB? And does this match the amount of space required for netdump to internally pass the "No space for dump image" test? As I recall from my testing, the dumps were always less than 1GB regardless of the memory size of the crashed system.
Oh, and the client *does* generally reboot. It did in this case, anyway, and I don't believe we've ever had a client hang while trying to generate a crash dump on a netdump server after crashing due to the memory corruption bug. Also I forgot to mention the other reason I think the "no space for dump image" message may be a red herring: because on the netdump client/server pair where we've successfully forced memory dumps to occur (in forced panic tests), we've also ended with zero-length vmcore-incomplete files when the system crashed on its own (as a result of the memory corruption bug). I didn't check /var/log/messages in those cases, though...but the point is that it seems we can get memory dumps when we force a crash, but not "in the wild" (or at least not with crashes associated with the corruption bug).
The vmcore files should be the size of memory plus a page for the prepended ELF header. The "No space" error message should not be considered a red herring (at least yet). That is especially true considering that the client reboots, because that means that the client received a reboot request, which would be the case if the no space situation was perceived (rightly or wrongly) by the netdump server. If there was no communications coming back from the netdump server to the client, the client would hang indefinitely. In any case, the partition containing /var/crash should always be capable of holding a full memory-sized dump. Re: audit, to temporarily turn the daemon off, just do a: $ service audit off on a running system. To prevent it from being started on a subsequent reboot: $ chkconfig audit off And to restore it a some time in the future: $ chkconfig audit on
Ok, I've symlink'ed /var/crash to a directory that should have sufficient space for a full memory image for the clients (though it's a very tight shave for the database servers). We'll see if that changes anything. I think I'd gotten the impression that the dump wouldn't include all of the memory image from your statement in comment 70 that "getting at least 1GB saved into the vmcore- incomplete may be enough to work with" (and "a vmcore-incomplete of a hugemem kernel would need uses most need to be at least 4GB in size" didn't help either... :-). I've forced a crash on one of our production database servers, and verified that it does in fact create a full vmcore file. My question about auditing was mainly to see if you had anything in mind other than the audit service itself. As of yesterday, I've turned off the audit service and unset SYSLOG_ADDR in the netdump config files on the servers where we've seen the corruption incidents. So now we just have to wait for the bug to strike again.
Hello, John. Another kernel bug has been discovered (in handling invalid or possibly unusual ELF library formats) that could possibly lead to arbitrary data corruption, which might occur in user data pages as well as in kernel data structures. This will be fixed in a U5 respin. I have also developed a patch to detect, report, and fix this problem as it occurs, but I'm still working on validating this test patch (which I hope to complete on Monday). Once the test patch is ready, I will make a test kernel available to you in order to verify whether the previously problematic code path is being triggered on your system. I'll update this BZ on Monday. Cheers. -ernie
Hello, John. I've completed validation of my test patch for detection, reporting, and recovery from a kernel bug in uselib() syscall handling. I've built test kernels based on the latest U5 beta (-31.EL) with the kernel version "2.4.21-31.EL.ernie.uselib.1" and have made several RPMs available on my "people" page. If this kernel detects a strange ELF library (that would have triggered the bug), the following set of messages will appear on the console: load_elf_library: prevented corruption from weird ELF file load_elf_library: dentry name 'badlib', inode num 33685 load_elf_library: kalloc() for 64 bytes returned 0xf796aa80 load_elf_library: kfree() was about to be passed 0xf796aaa0 where 'badlib' is the final pathname component of the strange ELF library with such-and-such inode number (which would allow you to locate the file). The test logic will then avoid the data corruption and allow the kernel to continue running normally. Please install one of the following RPMs on a system on which you're willing to run a U5 beta kernel from these URLs: http://people.redhat.com/~petrides/.uselib/kernel-2.4.21-31.EL.ernie.uselib.1.i686.rpm http://people.redhat.com/~petrides/.uselib/kernel-smp-2.4.21-31.EL.ernie.uselib.1.i686.rpm http://people.redhat.com/~petrides/.uselib/kernel-hugemem-2.4.21-31.EL.ernie.uselib.1.i686.rpm http://people.redhat.com/~petrides/.uselib/kernel-source-2.4.21-31.EL.ernie.uselib.1.i386.rpm If your test system incurs another data corruption or crash without those console messages appearing, then we'd know that the uselib() handling bug is not related to the problem you're seeing. Please let me know how it goes. Thanks. -ernie
We've set up two of our three systems with your new kernel, and we'll move the third over to it during a downtime period. Does this patch also send output to syslog, or only to the console? And do you have a test executable I could use to trigger the error?
Thanks, John. I'll put this BZ back into NEEDINFO state until you let us know whether you see any of the "load_elf_library" diagnostics or whether you get another crash or corruption without the diagnostics. The diagnostics from the test patch simply use printk() without any log level designator. Thus, they'd appear both on the console and in /var/log/messages (via the syslog mechanism). I do have a reproducer, but I'd rather not make it available because of the obvious security ramifications. We'll be making another U5 beta respin available soon containing a fix, but I've created the test kernel with diagnostics specifically for you so that we could definitively resolve whether your systems were hitting this bug or not.
FYI, we're now running the 2.4.21-31.EL.ernie.uselib.1 kernel on all three of the systems on which we typically see the corruption issue, and we're running with the hugemem kernel on two of those in order to maximize the chance of hitting the bug. We'll let you know as soon as we see another corruption incident (handled or unhandled).
Thanks, John. Reverting BZ to NEEDINFO again.
Hi, John. Any news?
Nothing yet. You will of course be the second to know if we see another corruption incident (handled or unhandled).
Created attachment 113275 [details] "log" output from still another system crash Another crash just a few hours after my last comment--this time on a system where we've not seen the memory corruption bug before (though it does share the same hardware configuration as the database servers where we often see the bug). And it was running the 2.4.21-31.EL.ernie.uselib.1hugemem kernel. It did produce a "log" file, but unfortunately no crash dump (another zero length vmcore-incomplete file). So assuming that this was indeed the memory corruption bug again, it looks like the uselib patch didn't fix it for us.
Rats. Looks like we're back to square 1 ... no leads. Thanks for the update, John.
EIP is at clear_inode [kernel] 0xb5 (2.4.21-31.EL.ernie.uselib.1hugemem/i686) eax: 0004086a ebx: cd42e180 ecx: 00000000 edx: 1b0c0000 esi: cd42e180 edi: 1b0c1f88 ebp: 00000000 esp: 1b0c1f58 ds: 0068 es: 0068 ss: 0068 Process kswapd (pid: 19, stackpage=1b0c1000) Stack: cd42e180 cd42e188 021811fc cd42e180 d2950068 1b300068 ccf22488 ccf22688 ccf22680 021814dc 1b0c1f88 00001bf2 c6d06d88 dd5cb688 023a1b00 0006eb39 00000001 00000040 02181724 00001bf2 00000001 021573f8 00000006 000001d0 Call Trace: [<021811fc>] dispose_list [kernel] 0x3c (0x1b0c1f60) [<021814dc>] prune_icache [kernel] 0x8c (0x1b0c1f7c) [<02181724>] shrink_icache_memory [kernel] 0x24 (0x1b0c1fa0) [<021573f8>] do_try_to_free_pages_kswapd [kernel] 0x168 (0x1b0c1fac) [<021575a8>] kswapd [kernel] 0x68 (0x1b0c1fd0) [<02157540>] kswapd [kernel] 0x0 (0x1b0c1fe4) [<021095ad>] kernel_thread_helper [kernel] 0x5 (0x1b0c1ff0) As best I can understand it, the failure is here: void clear_inode(struct inode *inode) { invalidate_inode_buffers(inode); if (inode->i_data.nrpages) BUG(); if (!(inode->i_state & I_FREEING)) BUG(); if (inode->i_state & I_CLEAR) BUG(); wait_on_inode(inode); DQUOT_DROP(inode); ===> if (inode->i_sb && inode->i_sb->s_op && inode->i_sb->s_op->clear_inode) inode->i_sb->s_op->clear_inode(inode); if (inode->i_bdev) bd_forget(inode); else if (inode->i_cdev) { cdput(inode->i_cdev); inode->i_cdev = NULL; } inode->i_state = I_CLEAR; } ...where the super_block's s_op field contains the 4086a value. It would be nice to be able to see the rest of the super_block. Given a zero-length vmcore file again, what was the error message on the netdump-server this time?
It was "No space for dump image" again (just tracked it down--I didn't realize this was a daemon message rather than a kernel message). This server hadn't ever crashed before, so its dump server hadn't been set up to take the enormous memory dump it would produce. I've changed that now, and we're also actively running the same process that killed it on Saturday, in the hopes of producing another crash.
I have good news about Dave's analysis in comment #107: that crash was caused by the problem reported in bug 124600, which was fixed in the recent U5 respin (for kernel version 2.4.21-32.EL). We can ignore the new crash and expect it to be fixed in U5. Since bug 124600 is not known to be able to cause user-space or file buffer corruption, please continue to use the 2.4.21-31.EL.ernie.uselib.1 kernels (at least until the latest U5 respin is available in the RHN beta channels). Thanks. -ernie
I have bad news: we ran into the memory corruption bug on the database server from comment 105, this time detected by Oracle (so the system didn't actually crash). The kernel didn't notice or prevent the corruption instance. So apparently the patch in the 2.4.21-31.EL.ernie.uselib.1 kernels doesn't fix our bug. The good news is that this is on a system where we have a way to try forcing the bug to occur that's more likely to be successful. We're using the system for other testing right now, but once that's over we'll start hitting it around the clock to try to force it to crash again. Would it be of any use to you for us to force a system panic when we know there's memory corruption in the system? Or do you need a panic that was actually caused (somehow) by the corruption?
> Would it be of any use to you for us to force a system panic when we know > there's memory corruption in the system? Or do you need a panic that was > actually caused (somehow) by the corruption? We wouldn't know what to look at with a forced crash. With a panic caused by the corruption, we at least have some kernel evidence sitting in front of us.
We are particularly interested for the resolution of this bug, in 8 way SMP machines with 4 GB (3G/1G) memory used for SNFS [ADIC] IO servers, which serve 25~50 TB of storage. Currently tried to expose the problem in kernel ââ15â with reducing mem=512, but without success. The local IO stress tests in IO servers run just fine, including tar-ing of /proc/kcore, etc. My current plan is to use all hardware resources and load the machine with 4 G memory, but reducing only the default kernel ââ15â memory to less than 512M. To achieve this would need to write and load a module that will poison-allocate more than 256M and monitor for infringements. Hopefully this will generate early warning for memory leaks, without triggering a kernel crash, ie without slab debugging turned on. Probably this kind of kernel memory âreductionâ would need to be refined latter along with some more sophisticated IO stressing and monitoring programs to exacerbate-analyze the problem. We can offer the necessary hardware and software resources to test the solution of this bug, using the above testing methodology. Appreciate any suggestions and cooperation⦠Rgds, Vangel
John, we've just recently discovered another subtle data corruption bug that can affect x86 systems with > 4GB of memory (assuming they're running either the smp or hugemem configs). My next U6 build will include a fix for this problem. Although we are still near the beginning of the U6 development cycle (the next build will be #4), I will make kernels available to you either tomorrow or Monday for testing. It is likely that we will issue a post-U5 erratum with this fix plus a few security fixes, and so it would be very valuable to us if you could determine whether this new bug fix resolves the problems you've seen. I'll update this BZ again when the kernels are available. Thanks. -ernie
Hi, John. We have finally resolved the problem of a very obscure PTE race condition that could cause arbitrary memory corruption (in either user or kernel data) on SMP x86 systems with greater than 4G of memory (with either the smp or hugemem configs). We have already verified definitively that our fix corrects two open bugzilla reports (151865 and 156023), and we are very interested if this fix resolves the data corruption problems you have encountered. The fix has just been committed to the RHEL3 U6 patch pool this evening (in kernel version 2.4.21-32.4.EL). Although this is just an interim Engineering build for U6 development, I have made the following kernel RPMs available on my "people" page for you to test: http://people.redhat.com/~petrides/.pte_race/kernel-hugemem-2.4.21-32.4.EL.i686.rpm http://people.redhat.com/~petrides/.pte_race/kernel-hugemem-unsupported-2.4.21-32.4.EL.i686.rpm http://people.redhat.com/~petrides/.pte_race/kernel-smp-2.4.21-32.4.EL.i686.rpm http://people.redhat.com/~petrides/.pte_race/kernel-smp-unsupported-2.4.21-32.4.EL.i686.rpm http://people.redhat.com/~petrides/.pte_race/kernel-source-2.4.21-32.4.EL.i386.rpm We intend to incorporate this bug fix in our next post-U5 security errata release (which has not yet been built, but which is likely to enter formal Q/A as soon as U5 is shipped on Wednesday, May 18th). In the meantime, please test this interim U6 build as soon as is practical so that we can determine if the fix addresses this bugzilla (and BZ 141905). Thanks again for your patience and invaluable assistance throughout this investigation. Cheers. -ernie
The fix (referred to above) for a data corruption problem has also just been committed to the RHEL3 E6 patch pool (in kernel version 2.4.21-32.0.1.EL). I'm tentatively moving this BZ to MODIFIED state and will list it in the associated security advisory that we intend to release post-U5. Please still follow up with test results as requested in comment #125. Thanks. -ernie
Thanks, this does seem hopeful. We'll get this kernel installed on our candidate systems as soon as possible. Does moving the bug to MODIFIED state and listing it in the security advisory mean that you consider it resolved, though? We've been down this route before (e.g. comment 63)...there's no way to tell that this bug is actually fixed except to wait a long, long time. We've often gone 4-6 weeks on the SMP kernel without corruption incidents, and in fact the last corruption incident was 4 weeks ago, on April 20th. So it will take a solid month or more before we can say with any confidence that this fix addresses the type of corruption incidents we've been seeing. BTW, if you have a more detailed description of the fix, that would be nice (since this bug is very high profile here, and so a lot of people will ask about it).
Hi, John. In response to your question about this bug being in MODIFIED state, yes, it means that I'm (optimistically) considering this to be resolved. I remember last time, but hopefully, I'll be right this time. :-) If you incur another memory corruption before we release the 2.4.21-32.0.1.EL kernel (which we're targeting for late next week), then I'll promptly change this bug back into ASSIGNED state and remove it (and bug 141905) from the advisory (which will be RHSA-2005:472). I understand that in only one week, we cannot be 100% confident that this bug is resolved. But I'd rather that this bug be appropriately associated with the advisory that contains the fix. (We won't be able to retroactively update the advisory after it's been released on RHN.) I'll attach the patch that fixes this problem along with a more thorough explanation shortly.
Created attachment 114542 [details] data corruption fix committed to RHEL3 U6/E6 kernel patch pools The critical part of this fix is in establish_pte(), where the call to pte_clear() has been replaced with a call to ptep_get_and_clear(). On x86 smp and hugemem configs, the PTE being operated on is 64 bits wide. The higher-order half represents part of the physical page frame number that would be non-zero only on systems with greater than 4GB of memory. The PTE could be accessed concurrently on different cpus only when it is being used by a multi-threaded application. Consider the case of a multi-threaded app (e.g., Java) in which one thread is executing a fork() syscall and another is about to modify an already writable page whose mapping is not yet in that cpu's TLB. Before the bug fix, cpu A performing the fork might be marking the PTE as read-only via establish_pte() and get as far as clearing the higher-order half of the PTE in set_pte() via pte_clear(). The top half is zeroed first in the include/asm-i386/pgtable-3level.h version. Then on cpu B, the other thread stores into a data page whose physical address was supposed to be in memory above 4GB. The MMU on cpu B would load a translation from the correct lower-half of the PTE but a zeroed higher-half of the PTE. Despite cpu A then completing the PTE update, cpu B has a wrong translation in its TLB up until the time that the flush_tlb_page() call on cpu A completes. During this window, cpu B can inappropriately modify the page in the first 4GB of memory at the physical address with the same lower-order 32 bits as the correct one. The fix works by using ptep_get_and_clear(), which zeros the lower-order part of the PTE first (additionally using a RMW cycle to memory). The same fix has been applied to unmap_hugepage_range(), although we do not believe that this latter code path caused any of the corruptions you experienced. Thus, to incur the corruption, the following things were required: 1) x86 architecture 2) PAE mode (smp or hugemem configs) 3) more than one cpu in the system 4) multi-threaded application execution 5) (probably) a fork() syscall from the application 6) extremely unlucky timing between establish_pte() and another cpu's MMU As you can see from this explanation, it isn't feasible to catch this sequence of events with a special diagnostic kernel because one of the critical items (MMU operation) is invisible to the kernel. I suppose you'd need a multi-cpu logic analyzer. Hope this helps. Cheers. -ernie
As a follow-up to my last comment, Larry Woodman and I have worked out a scenario where the establish_pte() call from handle_pte_fault() could allow the data corruption race condition to occur in the absence of any fork() system call from the multi-threaded application. Thus, item 5 in my list above is not a requirement.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-472.html
*** Bug 163699 has been marked as a duplicate of this bug. ***
*** Bug 158328 has been marked as a duplicate of this bug. ***