Description of problem: Kernel panic occurs on RHEL 3 base and updates 1,2,3 on Dell PowerEdge 2650. Various processes implicated in the oops although it doesn't occur until ServerVantage Linux agent in process of starting. Also getting hit with the aacraid panic at bootup of Dell PowerEdge 2650 server. Version-Release number of selected component (if applicable): Version 3 at all levels of updates to the kernel. How reproducible: Call ServerVantage Linux agent restart_ecoagt script which starts the agent. It usually doesn't occur on the 1st start, but randomly. Running the smp kernel reduces the number of starts before it occurs. Steps to Reproduce: 1. Execute restart_ecoagt. 2. 3. Actual results: See attachment. Expected results: No panic. Additional info: Occurs whether the agent is compiled on RH 9 or RHEL 3. We require NPTL in order to get multi-threading to work as described by the POSIX standard. No such problem running on AIX, HP- UX, or Solaris. Based on the ServerVantage Linux agent logging, occurs at various stages of startup (no common piece of SV code). Customers of ServerVantage Linux agent have also reported panics running RHEL 3 on other hardware such as IBM xSeries. This problem does not occur on RH 9.0 (VMware image) or SuSE Linux Enterprise Server 9 (Dell OptiPlex GX 260). Going to install SLES 9 on the same 2650 in order to eliminate hardware|memory possibilities. ServerVantage is not open source.
Created attachment 103959 [details] 4 Oops captured using ttyS0 -> HyperTerminal, copied to text file.
svdevrhl30b.prodti.compuware.com login: Unable to handle kernel paging request a t virtual address 00010003 printing eip: c01660e3 *pde = 2ecc1001 *pte = 00000000 Oops: 0002 nfs lockd sunrpc audit lp parport autofs4 tg3 e100 ipt_REJECT ipt_state ip_connt rack iptable_filter ip_tables floppy sg microcode keybdev mousedev hid input u CPU: 3 EIP: 0060:[<c01660e3>] Not tainted EFLAGS: 00010206 EIP is at get_unused_buffer_head [kernel] 0x83 (2.4.21-20.ELsmp/i686) eax: 0000ffff ebx: f2a69000 ecx: 0000ffff edx: 0000ffff esi: 00000000 edi: edaab780 ebp: 00000000 esp: f3ff9e2c ds: 0068 es: 0068 ss: 0068 Process kjournald (pid: 188, stackpage=f3ff9000) Stack: c3ac2268 000000f0 f885975c 00000000 c3a85800 ed63e0b4 edaab780 0000000d c1bdf0c8 00000000 00000000 f0f07870 00000000 edaab780 0000000d f8856ad9 f29f2e80 f0f07870 f3ff9e98 00000ac5 00000005 c3a85894 00000000 00000f44 Call Trace: [<f885975c>] journal_write_metadata_buffer [jbd] 0xec (0xf3ff9e34) [<f8856ad9>] journal_commit_transaction [jbd] 0xed9 (0xf3ff9e68) [<f885951a>] kjournald [jbd] 0x17a (0xf3ff9fb0) [<f8859380>] commit_timeout [jbd] 0x0 (0xf3ff9fd4) [<f88593a0>] kjournald [jbd] 0x0 (0xf3ff9fe4) [<c01095ad>] kernel_thread_helper [kernel] 0x5 (0xf3ff9ff0)
Installed SLES 9 on the same Dell 2650 and the panic has not occurred. This would seem to indicate a problem in the RH 2.4.21 kernel or NPTL 0.60. SLES 9 has the 2.6.5 kernel and NPTL 0.61. We would like to help resolve this in whatever way we can.
I forgot to mention that the only changes to run on SLES were makefile related, e.g. the different locations of C++ headers. The code itself was unchanged. This problem is similar to that of stopping Lotus Domino except turning off the audit daemon did not help us.
The other oopses here are: kernel BUG at page_alloc.c:242! invalid operand: 0000 nfs lockd sunrpc audit lp parport autofs4 tg3 e100 ipt_REJECT ipt_state ip_connt rack iptable_filter ip_tables floppy sg microcode keybdev mousedev hid input u CPU: 3 EIP: 0060:[<c0157d0f>] Not tainted EFLAGS: 00010286 EIP is at __free_pages_ok [kernel] 0xef (2.4.21-20.ELsmp/i686) eax: f6e35300 ebx: c1dc84e8 ecx: 0003ace1 edx: 00000000 esi: f6e35300 edi: b74b4000 ebp: 00000000 esp: f6e0fdec ds: 0068 es: 0068 ss: 0068 Process ecocComputer (pid: 2698, stackpage=f6e0f000) Stack: c03a6480 00000002 c03a6364 00000286 fffffffe c03a76dc c1dc8524 c03a6480 c1d2002c c03a7664 00000286 fffffffe 00000b38 0000008e c1dc84e8 b74b4000 000000dc c013eaf2 c1dc84e8 0000008e 000000dc c013fbbd f7ba4380 f6e20dd0 Call Trace: [<c013eaf2>] __free_pte [kernel] 0x52 (0xf6e0fe30) [<c013fbbd>] zap_page_range [kernel] 0x1ed (0xf6e0fe40) [<c0146d3a>] exit_mmap [kernel] 0xda (0xf6e0fe94) [<c0126879>] mmput [kernel] 0x69 (0xf6e0feb8) [<c012d596>] do_exit [kernel] 0x186 (0xf6e0fec8) [<c012d92b>] do_group_exit [kernel] 0x8b (0xf6e0fee4) [<c01372c0>] get_signal_to_deliver [kernel] 0x1f0 (0xf6e0fef8) [<c010bef4>] do_signal [kernel] 0x64 (0xf6e0ff20) [<c013c213>] do_futex [kernel] 0xe3 (0xf6e0ff58) [<f8865e99>] ext3_file_write [ext3] 0x39 (0xf6e0ff74) [<c013c2e9>] sys_futex [kernel] 0xb9 (0xf6e0ff88) Code: 0f 0b f2 00 db bb 2b c0 8b 43 14 85 c0 0f 85 6c 02 00 00 b8 Kernel panic: Fatal exception [root@svdevrhl30b root]# Unable to handle kernel paging request at virtual addre ss 00010003 printing eip: c01660e3 *pde = 2cbf9001 *pte = 00000000 Oops: 0002 nfs lockd sunrpc lp parport autofs4 audit tg3 e100 ipt_REJECT ipt_state ip_connt rack iptable_filter ip_tables floppy sg microcode keybdev mousedev hid input u CPU: 1 EIP: 0060:[<c01660e3>] Not tainted EFLAGS: 00010206 EIP is at get_unused_buffer_head [kernel] 0x83 (2.4.21-20.ELsmp/i686) eax: 0000ffff ebx: 00000000 ecx: 0000ffff edx: 0000ffff esi: 00000000 edi: 00001000 ebp: 00000001 esp: eee3fe1c ds: 0068 es: 0068 ss: 0068 Process ecoagt (pid: 4619, stackpage=eee3f000) Stack: c3ac2268 000000f0 c01661b8 00000001 00000000 efafa800 00000806 c1dfcf04 dcd25680 c1dfcf04 c01665d6 c1dfcf04 00001000 00000001 c1dfcf04 efadc300 c0166c0d c1dfcf04 00000806 00001000 000000f0 00000000 ff0c5000 eee3fe88 Call Trace: [<c01661b8>] create_buffers [kernel] 0x28 (0xeee3fe24) [<c01665d6>] create_empty_buffers [kernel] 0x26 (0xeee3fe44) [<c0166c0d>] __block_prepare_write [kernel] 0x2fd (0xeee3fe5c) [<f885337b>] new_handle [jbd] 0x4b (0xeee3fe84) [<c0167479>] block_prepare_write [kernel] 0x39 (0xeee3fea0) [<f88684e0>] ext3_get_block [ext3] 0x0 (0xeee3feb4) [<f8868bb9>] ext3_prepare_write [ext3] 0xc9 (0xeee3fec0) [<f88684e0>] ext3_get_block [ext3] 0x0 (0xeee3fed0) [<c014b6e3>] do_generic_file_write [kernel] 0x1e3 (0xeee3fef4) [<c014bc3f>] generic_file_write [kernel] 0x13f (0xeee3ff48) [<f8865e99>] ext3_file_write [ext3] 0x39 (0xeee3ff74) [<c01635a7>] sys_write [kernel] 0x97 (0xeee3ff94) Code: c7 40 04 ff ff ff ff c7 40 2c 00 00 00 00 f0 fe 0d 08 80 3a Kernel panic: Fatal exception Unable to handle kernel paging request at virtual address ffffffc8 printing eip: c012d9a0 *pde = 00000000 Oops: 0000 nfs lockd sunrpc lp parport autofs4 audit tg3 e100 ipt_REJECT ipt_state ip_connt rack iptable_filter ip_tables floppy sg microcode keybdev mousedev hid input u CPU: 3 EIP: 0060:[<c012d9a0>] Not tainted EFLAGS: 00010246 EIP is at eligible_child [kernel] 0x20 (2.4.21-20.ELsmp/i686) eax: ffffffff ebx: ffffff40 ecx: ffffff40 edx: 00000000 esi: f6988000 edi: 00000000 ebp: f69880b8 esp: f6989f40 ds: 0068 es: 0068 ss: 0068 Process sh (pid: 1850, stackpage=f6989000) Stack: c012de7e ffffffff 00000000 ffffff40 00010206 00000000 00000001 00000000 f6988000 00000000 00000000 00000000 00000000 04000000 0013eeb8 00000000 f6988000 f6988170 f6988170 00000000 08074020 04000000 0013eeb8 f6988000 Call Trace: [<c012de7e>] sys_wait4 [kernel] 0xde (0xf6989f40) [<c012e0b7>] sys_waitpid [kernel] 0x27 (0xf6989fac) Code: 8b 81 88 00 00 00 83 f8 ff 74 5c 85 d2 79 51 83 f8 11 74 3c Kernel panic: Fatal exception
You also refer to "the aacraid panic" but that's not described anywhere: do you have information about that panic too? The oopses above show little except for vague evidence of random memory corruption. There's almost no other common element in them. We'd really need something more concrete to pursue this --- a reproducer that we can run ourselves, for example, or a crash dump, or a much better description of what precise behaviour triggers the problem.
The aacraid-induced panic is 131703. I followed the steps to work around it as described in that bug. I wish I knew what (if anything) we can control (code-wise) to prevent these panics. It seems something in the ServerVantage startup process (minimum 3 processes) is exposing|causing these panics. It occurs sooner running an SMP kernel. It does not occur on RH 9.0 or SLES 9 (SLES 9 required recompiling the same code). I'm working on 2 tasks now; 1- make xconfig and load ../arch/i386/defconfig and change ONLY CONFIG_IKCONFIG to y, recompile, etc. and run that kernel (RHEL3 update 2), 2- Download the RHEL4 beta and try running SV there. I can provide a crash dump provided I get explicit instructions on going about capturing it.
RHEL3 supports network dumps, but not disk dumps. There's a whitepaper at http://www.redhat.com/support/wpapers/redhat/netdump/index.html
OK after 4 tries I finally captured both log and vmcore on my netdump- server VMware image. The client VMware image is config'd with 256M so how do I get it to you? gzipped it's down to 72M+ and Bugzilla won't let me attach it, doubt we have a publicly available URL you could get it from... email, ftp?!
Created attachment 105163 [details] netdump log file. Here's the log file anyway...
Moved 2 TAR+gzipped files to: ftp://ftp.compuware.com/pub/vantage/server/outgoing/rhel3_132838/netdu mp1_files.TAR ftp://ftp.compuware.com/pub/vantage/server/outgoing/rhel3_132838/netdu mp2_files.TAR Each contains log and vmcore files. The 2nd panic occurred on the automatic reboot when logging in as root using KDE, process in the Oops is kdeinit.
I cannot access those files: $ lftp ftp://ftp.compuware.com lftp ftp.compuware.com:~> cd pub/vantage/server/outgoing/ cd: Access failed: 550 /pub/vantage/server/outgoing: Access is denied.
If you use the full URL including netdump1_files.TAR in the browser Address it should prompt you to Open, Save, etc.
Added another set of vmcore & log files at (enter full URL to download): ftp://ftp.compuware.com/pub/vantage/server/outgoing/rhel3_132838/netdu mp3_files.TAR.
Tried raising priority to High but I guess the "Reporter" is not the same as the "submitter" so it wouldn't take. Now getting panics simply running the tar command during ServerVantage (9.7) installation before ANY of our code has a chance to run. Unable to recreate (using restart_ecoagt processing) on RHEL 4 Beta.
I don't have access to RHN Errata (up2date) to download individual kernel packages as we have only 5 machines configured for up2date. I can request one of the 5 IT machines be up2date'd or if "update 4" is available in ISO images for download, install them on my VMware image. I checked Easy ISOs on RHN and only see Update 3.
John, you should be able to login to https://rhn.redhat.com, then go to: https://rhn.redhat.com/network/software/channels/downloads.pxt?cid=1186 That link should take you to the Update 4 Beta ISO images. Update 4 has not been released yet in final form. However, I think that we would like you to try and test with a kernel that has slab-debug enabled. Watch this space, we'll post a link.
In this location: http://people.redhat.com/anderson/.BZ_132838 there is a U4 kernel with CONFIG_DEBUG_SLAB turned on. It's a UP kernel, which is more useful as far as slab debugging is concerned because the UP kernel doesn't use per-cpu slab object caches, which aren't affected by the slab debug code. The directory contains four files: kernel-2.4.21-27.slab_debug1.EL.i686.rpm kernel-debuginfo-2.4.21-27.slab_debug1.EL.i686.rpm vmlinux-2.4.21-27.slab_debug1.EL vmlinux-2.4.21-27.slab_debug1.EL.debug but only kernel-2.4.21-27.slab_debug1.EL.i686.rpm needs to be downloaded, installed, and rebooted: $ rpm -ivh kernel-2.4.21-27.slab_debug1.EL.i686.rpm Please ensure that netdump is still enabled, and then try to get us a netdump or two. The other files in the directory consist of the kernel debuginfo package, and for convenience sake only, the vmlinux and vmlinux.debug files extracted from the two binary RPMs. These will only be of use for subsequent analysis of any dumpfiles. In any case, the slab debug code is not a panacea for all slab corruption problems, but hopefully help trap the problem at hand at an earlier stage. FWIW, I'll also attach my notes re: the slab corruption in the first two dumps to this BZ for future reference if necessary.
Created attachment 108334 [details] crash analysis notes for netdump #1
Created attachment 108336 [details] crash analysis notes for netdump #2
Installed U4 beta for grins, panic still happened. Applied slab debug kernel and captured vmcore and log. Please download from: ftp://ftp.compuware.com/pub/vantage/server/outgoing/rhel3_132838/netdu mp_debug.TAR.gz Remember to enter the full URL in your browser to download.
Kernel panicked again when logging in using KDE after the auto- reboot, the second time this happened. The ServerVantage init.d script and rc*.d symlinks had been removed so we were not involved. The running process was artsd. ftp://ftp.compuware.com/pub/vantage/server/outgoing/rhel3_132838/netdu mp2_debug.TAR.gz
Created attachment 108560 [details] crash analysis notes for vmlinux-2.4.21-21.EL kernel
Created attachment 108562 [details] crash analysis notes for slab-debug kernel
The 2.4.21-21.EL and the 2.4.21-27.slab_debug1.EL kernel panics, like the previous two, seemingly have no relationship in their end results -- other than the fact that all 4 dumpfiles show corruption in the size-2048 slab at a minimum, and in some cases, more than that particular slab. The slab-debug kernel's protection mechanism obviously didn't catch anything in the act of a double free, which would seemingly have been the case, since all the dumps have size-2048 slab chains (the partial and full) intermingled. It's not clear to me how else they could get into that state without being "caught" by the slab debug code. What I'm wondering now is exactly *when* does this corruption occur. If a "cat /proc/slabinfo" were to be done with the size-2048 slab chains in the state seen in the dumpfiles, one of the following 3 BUG()'s would panic the system: list_for_each(q,&cachep->slabs_full) { slabp = list_entry(q, slab_t, list); if (slabp->inuse != cachep->num) BUG(); active_objs += cachep->num; active_slabs++; } list_for_each(q,&cachep->slabs_partial) { slabp = list_entry(q, slab_t, list); if (slabp->inuse == cachep->num || !slabp->inuse) BUG(); active_objs += slabp->inuse; active_slabs++; } list_for_each(q,&cachep->slabs_free) { slabp = list_entry(q, slab_t, list); if (slabp->inuse) BUG(); num_slabs++; } Can you simply boot the system, don't run anything, but just go into a virtual terminal window or serial console preferably, and enter "cat /proc/slabinfo"? And then perhaps try a few things that have preceded the previous crashes, and do another cat of /proc/slabinfo. There must be some act that causes the corruption, and perhaps it can be tracked down in this manner. Other than that, perhaps another run with the slab-debug kernel would produce a netdump that would give us more clues.
I forgot to update that the 2nd crash-dump on the auto reboot had picked up the default U4 kernel. I've updated the grub/menu.lst to default to the slab_debug kernel. I'll add the cat of /proc/slabinfo to my start_loop script so it can be captured between calls to restart_ecoagt script which fires up the SV agent processes.
Added another crash dump to the CW FTP site: ftp://ftp.compuware.com/pub/vantage/server/outgoing/rhel3_132838/netdu mp3_debug.TAR.gz I added slabinfo.log as well, which was the cat /proc/slabinfo grep'd for "size-2048" which didn't look unusual just before it crashed.
How about we provide you our Linux Agent binaries so that you can run it on your system(s) using whatever debug kernel you require? The Control Server component which runs on Windows is not required to recreate the panics we're exposing.
Created attachment 109403 [details] notes for last dumpfile sent
FWIW, I attached the notes for the last dumpfile, finding essentially the same problem -- two corrupt slab caches, the size-2048 as has been the case in all prior dumps, as well as size-32 cache as well. But the dump trace is unrelated to the others, probably due to previous slab cache corruption. In any case, absolutely, if you can give us a reproducer, it would be to both of our benefits! Tell us what to do! Thanks, Dave
Added linux_ia32.TAR.gz to the Compuware FTP website, point your browser at the following URL to download it: ftp://ftp.compuware.com/pub/vantage/server/outgoing/rhel3_132838/linux _ia32.TAR.gz Once you have it downloaded to /tmp on a test system running RHEL 3 U4, uncompress it. Then extract the install file to install the linux_ia32.TAR file, e.g. ./install which will walk you through an install. Any ?? email call me at 313- 227-6779.
I forgot... to recreate the problem there's another script in linux_ia32.TAR.gz that is normally not shipped, start_loop. It invokes /usr/ecotools/bin/restart_ecoagt every 5 seconds. To speed things up between restarts you can reduce the sleep in start_loop and the sleeps in stop_agt...
Got this far -- but don't know what to do re: "control server" info: Select one of the following options: 1) Unload agent station software 2) Install agent station software 3) Configure monitoring of databases/applications 4) Transfer agent station software to remote agents Q) Quit installation Enter your selection [Q] => 2 Checking for NPTL. Threads: NPTL 0.45 Found Native POSIX Threads Library! Enter the full pathname of the directory to store temporary files [/usr/tmp/ServerVantage/tmp] => Enter the full pathname of the directory to store log files [/usr/tmp/ServerVantage/tmp] => Enter the full pathname of the directory to store data files [/usr/tmp/ServerVantage/datafiles] => During the install process, please specify the Control Server by hostname or by IP address. Using the IP option is best for complex networks, including firewalls, multiple local NICs, or lack of DNS resolution. Does a firewall exist between this agent station and the control server (y/n) [n] => The current control server configuration is: CS Hostname : No current value CS IP address : No current value Event port : No current value TCP port : No current value Ecoagt RPC min port : No current value Ecoagt RPC num ports : No current value Ecoagt Localhost IP : No current value Ecoagt Localhost Alias : No current value Enter the host name of the control server =>
You can use any hostname or IP address that is listed in /etc/hosts but do not test the CS connection which follows, just enter n for that test.
Have you been able to get the SV Linux agent installed successfully? Once the Control Server is entered the other config parms can be defaulted.
Sorry John -- I haven't been able to get the change to retry it. I'm working on another reproducible slab cache corruption issue that may be related to this one; if it turns out not to be, I'll get back on this one as soon as I can.
As it turns out, after revisiting the original dumps from this case, the error signature is the same as with the other case I'm currently working on. The size-2048 cache is not being corrupted in my other case, but the data structure corruption seen shows exactly the same thing as with this case. I didn't realize it at the time, but in this case and in my other reproducer, data structures from the slab cache are being over-written by a piece of an active task_struct. We're trying various strategies to "catch it in the act", but since it's really not a case of a double-free, or other slab cache mishandling, the slab-debug code is not catching it. In any case, I may come back to using ServerVantage as the reproducer, but I just wanted to let you know that we're attacking this issue with the highest priority.
Good news I guess... will wait to hear from you.
Created attachment 109604 [details] Installation log file from installing ServerVantage This was one of the errors in the install.log /opt/ServerVantage/bin/ecoconfig: error while loading shared libraries: libeco_core.so: cannot open shared object file: No such file or directory The installation continued but the ecoagt would never start. Please verify the installation.log file. Let me know if you need any additional information.
Created attachment 109607 [details] Result from start_loop script After manually setting the LD_LIBRARY_PATH variable to what is in the /etc/init.d/ServerVantage LD_LIBRARY_PATH. We started the loop test. The log file has the output of the test script. Do we need to have java installed fo rhtis application to work properly?
Hmmmm... restart_ecoagt should be setting LD_LIBRARY_PATH, ECOBOOTSTRAP, and ECOHOME. There should be a logerror.ecoagt.<pid> in /opt/ServerVantage/tmp. I think the problem is in the 9.1 agents we look for a FlexLM license on the agents. Let me get you a 9.7 agent that doesn't require licensing checks on the agents, it'd done on the Windows CS.... I'll post another URL for the 9.7 agent once I get it on the FTP site.
OK the 9.7 agent is here: ftp://ftp.compuware.com/pub/vantage/server/outgoing/rhel3_132838/linux _ia32.TAR.Z Sorry about that...
John, I did an unistall of the previous version. That removed all teh directories and files. When I opened the new tar file the install script is missing also your start_loop script is not in the teh tar file. Did you want me to install of this version right on top of the other version with out doing an unistall? Jeff
You can install over the old directories, just do 1) Unload. can you still extract the start_loop script from the original TAR file?
The test ran just fine this time. Unfortunately it did so about 500 times in a row without failing, so we're going back to our other test scenario that leads to the same result. Thanks, anyway... We'll keep this case posted when we come up with something.
It happens sooner on an SMP kernel. On a single CPU box I've seen it go for 2-3K restarts before it panics. Be patient, if you run it it will panic. Also seems to happen pretty quick on a Dell PowerEdge 2650 running SMP kernel.
We were running it on a SMP kernel running on a UP box. We just thought we'd get it to happen quicker with this test than the one we'd been using, which typically takes an hour or so. It really doesn't make much difference what test we use, and we can make it happen fairly quickly with our test.
Dave: Our VPs are asking for a status on the testing this bug. Can you please provide an update?
Sure. We know what the "error signature" is, but have not yet come up with a way to catch it: The last 496 bytes of a task_struct (in our test case, that of the currently-running task), are being errantly copied to the beginning of a slab cache page. The problem is figuring out when and how it happens; by the time we bump into it and the system crashes, it's well past the time of corruption. We're adding debug code to various data-move routines, checking for source addresses (i.e., in the current task), and for destination addresses that are in the typically-targeted slab caches (in our tests, they predictably end up corrupting a page in the inode cache, the dentry cache or the size-128 cache). All I can say is that it's getting full-time attention, with no resolution as of yet.
John, In the original post this issue was being seen on a Dell PowerEdge 2650. Could you please provide a little more information on that hardware configuration? Specifically I am looking for the memory configuration. But the more data the better. If the system has not been modified from the factory ship, I can get all the information I need if you pass along the Dell Service Tag. Thanks in Advance
Jeff, The "tag" is JLDRY11 but this also happens in a VMware image on a Dell OptiPlex GX260, also IBM Zseries at customer sites among others.
John and Stephen, The following test kernels (plus the kernel-source RPM) are available under this Red Hat people page: http://people.redhat.com/~petrides/.kcore/kernel-2.4.21-27.0.2.EL.ernie.kcore.1.i686.rpm http://people.redhat.com/~petrides/.kcore/kernel-smp-2.4.21-27.0.2.EL.ernie.kcore.1.i686.rpm http://people.redhat.com/~petrides/.kcore/kernel-hugemem-2.4.21-27.0.2.EL.ernie.kcore.1.i686.rpm http://people.redhat.com/~petrides/.kcore/kernel-source-2.4.21-27.0.2.EL.ernie.kcore.1.i386.rpm Please let us know whether this proposed fix resolves the data corruption problem you've encountered. If you need a different RPM, just list that here and one of us will make it available to you. Thanks in advance.
Thanks Jeff... I will install it and give it a try and probably get back to you Monday unless it panics today... hopefully it doesn't panic and runs the w/e!
John, Just following up. All my test ran succesfully over the weekend. Any word on how your testing is going? Thanks Jeff
Jeff, So far the kernel patch looks good. I've hit 9000+ restarts (of ecoagt) without seeing any panics. Since we've seen it rather quickly and predictably on a dual-CPU machine I'd like to test it there as well. I'm currently waiting for access to a dual-CPU machine. Regards, John
John, Just to clarify what the problem is, and why we're confident that it addresses your problem. In your libeco_compsys.so.9.1 library, there is this: $ strings libeco_compsys.so.9.1 | grep kcore file /proc/kcore | awk '{print $3}' | awk -F- '{print $1}' 2>/dev/null $ /proc/kcore is read to determine whether the kernel is 32-bit. Your install script does the same thing to determine "OS_BITS", although it would only do it one time... When the pseudo-file /proc/kcore is accessed, the kernel dynamically creates a "fake" ELF header, which is then copied out to the user space program (the "file" command in your case). It typically only needs 1 page to create the ELF header, and so a single page is allocated in the kernel. However, there are circumstances when, depending upon the number of vmalloc() calls that have been made during your kernel's run-time, where it needs more than the 1 page that was allocated. The over-run consists of the tail-end of a copy of the currently-running task's task_struct, which flows into the beginning of the next page after the allocated one. Typically (but not necessarily) this tends to be slab cache page, and the corruption is usually not encountered until a much later point in time. We were re-creating the problem by simply doing a tar of /proc, which also pre-determines the type of file /proc/kcore is before reading it. That was far more deterministic in the re-creation of the problem, since the tar process was consuming most of memory with dentry and inode slab cache data. In any case, the fix was to correctly calculate the size of the ELF header.
Dave, Jeff, It looks like the panic situation starting SV has been resolved for I still have not encountered a panic on a single CPU image. Before we inform our customers it is safe to run us in a production environment we're still going to test on a dual-CPU machine. Thank you and the others involved for your help. Regards, John
Good morning all: Before we close out this bug, I request to allow John to complete all testing in multi-CPU environments. I want to make sure our customers are covered and can be assured that the product is safe to run in their production environments. Otherwise, we will be back with this issue again.
It looks to be running good on the dual-CPU as well, 3600+ restarts without a panic. The "official" kernel patch that is being tested by RH QA, will it be the same as what I've tested with? Regards, John
Hi, John. The official patch that I'm going to commit to U5 tonight has had one minor improvement for maintainability (which was suggested during code review). However, the functionality will match exactly what you're already testing.
A fix for this problem has just been committed to the RHEL3 U5 patch pool this evening (in kernel version 2.4.21-27.10.EL).
Hi Ernie, Will there be a separate kernel patch or will it be available only in U5? -John
Created attachment 110472 [details] /proc/kcore fix committed to RHEL3 U5 Hi, John. I've attached the exact patch that was committed to U5 last week. You've been testing with a patch that is only slightly different in the 1st patch hunk, but the functionality is the same. We don't currently have plans to release a pre-U5 erratum with this fix, although our support organization might consider making a U5 "Hot Fix" kernel (based on my interim U5 build Friday) available to select customers after it's had a little Q/A. (Hot Fix kernels are snapshots of the next Update-in-progress, and thus include everything we've committed to U5 so far -- about 80 fixes at this point.)
Hi Ernie, Do you have an ETA for Update 5? I'm considering the removal of the file command against /proc/kcore from our code and install script since we can just default to 32bit. However we have to run tar to extract ourselves so we'd still expose our customers to this problem until U5 is released. Regards, John
John, U5 beta is currently scheduled to start mid-March, and final release is currently scheduled for beginning of May.
Ernie: This is Stephen Karniotis at Compuware. We have identified another 7+ mutual customers that have incurred this problem with expectations of more as they deploy our Linux Agent. We need to get this patch in their hands before May. We would prefer a Hot Fix for this if possible. We are also open to getting an agreement created from Red Hat to allow us to distribute this to our Premier Customers and allow them to download from either our site or yours. Please discuss with Bret Hunter in the Alliance Organization as well as your management and have someone call me to discuss. My direct number is (313) 227-4350; wireless is (248) 408-2918. Need a resolution very soon.
Hello, Stephen. I'm just a lowly engineer (and RHEL3 kernel pool maintainer). All I can tell you is that a RHEL3 pre-U5 kernel with this fix has already been built, and it is a viable candidate for a "Hot Fix" kernel that could be provided by our Customer Support organization. Since Bugzilla is simply a bug tracking tool, I'd recommend that you engage Customer Support directly (indicating that the fix you want is in kernel version 2.4.21-27.10.EL or later).
I am attached to this issue from the pertner team. I am waiting for Stephen Karniotis to get back to me. -----Mike
This issue was marked MODIFIED by Ernie Petrides as part of his errata tracking procedure when he puts the fix into the RHEL3 source tree. Why was it put back to ASSIGNED state?
Reverting to MODIFIED state until MikeW provides evidence that this problem has not been fixed in RHEL3 U5 beta.
I have no idea what that means. Right now, Compuware is eagerly awaiting the update release that contains the fix. Their customer base is all operating on hacked up workarounds right now until we can get them the drop with the fix.
The "Bug Activity" shows that you (Mike) changed the status from MODIFIED to ASSIGNED at the same time you made your first post. I don't know whether you did that manually, or if it got switched automatically somehow?
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-294.html