Bug 132838 - Kernel Panic: Unable to satisfy kernel paging request... when starting ServerVantage.
Summary: Kernel Panic: Unable to satisfy kernel paging request... when starting Server...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: i686
OS: Linux
high
high
Target Milestone: ---
Assignee: Dave Anderson
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-09-17 17:45 UTC by John Schmidt
Modified: 2009-03-31 09:52 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-05-18 13:28:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
4 Oops captured using ttyS0 -> HyperTerminal, copied to text file. (11.55 KB, text/plain)
2004-09-17 17:58 UTC, John Schmidt
no flags Details
netdump log file. (5.47 KB, text/plain)
2004-10-13 21:06 UTC, John Schmidt
no flags Details
crash analysis notes for netdump #1 (5.11 KB, text/plain)
2004-12-10 18:05 UTC, Dave Anderson
no flags Details
crash analysis notes for netdump #2 (4.71 KB, text/plain)
2004-12-10 18:07 UTC, Dave Anderson
no flags Details
crash analysis notes for vmlinux-2.4.21-21.EL kernel (9.54 KB, text/plain)
2004-12-14 20:10 UTC, Dave Anderson
no flags Details
crash analysis notes for slab-debug kernel (35.33 KB, text/plain)
2004-12-14 20:13 UTC, Dave Anderson
no flags Details
notes for last dumpfile sent (8.88 KB, text/plain)
2005-01-05 21:59 UTC, Dave Anderson
no flags Details
Installation log file from installing ServerVantage (3.63 KB, application/octet-stream)
2005-01-11 13:46 UTC, Jeff Burke
no flags Details
Result from start_loop script (544 bytes, application/octet-stream)
2005-01-11 14:14 UTC, Jeff Burke
no flags Details
/proc/kcore fix committed to RHEL3 U5 (4.56 KB, patch)
2005-02-01 00:51 UTC, Ernie Petrides
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2005:294 0 normal SHIPPED_LIVE Moderate: Updated kernel packages available for Red Hat Enterprise Linux 3 Update 5 2005-05-18 04:00:00 UTC

Description John Schmidt 2004-09-17 17:45:37 UTC
Description of problem: Kernel panic occurs on RHEL 3 base and 
updates 1,2,3 on Dell PowerEdge 2650.  Various processes implicated 
in the oops although it doesn't occur until ServerVantage Linux agent 
in process of starting.  Also getting hit with the aacraid panic at 
bootup of Dell PowerEdge 2650 server. 


Version-Release number of selected component (if applicable): Version 
3 at all levels of updates to the kernel.


How reproducible: Call ServerVantage Linux agent restart_ecoagt 
script which starts the agent.  It usually doesn't occur on the 1st 
start, but randomly.  Running the smp kernel reduces the number of 
starts before it occurs.


Steps to Reproduce:
1. Execute restart_ecoagt.
2.
3.
  
Actual results: See attachment.


Expected results: No panic.


Additional info: Occurs whether the agent is compiled on RH 9 or RHEL 
3.  We require NPTL in order to get multi-threading to work as 
described by the POSIX standard.  No such problem running on AIX, HP-
UX, or Solaris.  Based on the ServerVantage Linux agent logging, 
occurs at various stages of startup (no common piece of SV code).  
Customers of ServerVantage Linux agent have also reported panics 
running RHEL 3 on other hardware such as IBM xSeries.  This problem 
does not occur on RH 9.0 (VMware image) or SuSE Linux Enterprise 
Server 9 (Dell OptiPlex GX 260).  Going to install SLES 9 on the same 
2650 in order to eliminate hardware|memory possibilities.  
ServerVantage is not open source.

Comment 1 John Schmidt 2004-09-17 17:58:00 UTC
Created attachment 103959 [details]
4 Oops captured using ttyS0 -> HyperTerminal, copied to text file.

Comment 2 Arjan van de Ven 2004-09-17 18:00:20 UTC
svdevrhl30b.prodti.compuware.com login: Unable to handle kernel paging
request a
t virtual address 00010003
 printing eip:
c01660e3
*pde = 2ecc1001
*pte = 00000000
Oops: 0002
nfs lockd sunrpc audit lp parport autofs4 tg3 e100 ipt_REJECT
ipt_state ip_connt
rack iptable_filter ip_tables floppy sg microcode keybdev mousedev hid
input u
CPU:    3
EIP:    0060:[<c01660e3>]    Not tainted
EFLAGS: 00010206
 
EIP is at get_unused_buffer_head [kernel] 0x83 (2.4.21-20.ELsmp/i686)
eax: 0000ffff   ebx: f2a69000   ecx: 0000ffff   edx: 0000ffff
esi: 00000000   edi: edaab780   ebp: 00000000   esp: f3ff9e2c
ds: 0068   es: 0068   ss: 0068
Process kjournald (pid: 188, stackpage=f3ff9000)
Stack: c3ac2268 000000f0 f885975c 00000000 c3a85800 ed63e0b4 edaab780
0000000d
       c1bdf0c8 00000000 00000000 f0f07870 00000000 edaab780 0000000d
f8856ad9
       f29f2e80 f0f07870 f3ff9e98 00000ac5 00000005 c3a85894 00000000
00000f44
Call Trace:   [<f885975c>] journal_write_metadata_buffer [jbd] 0xec
(0xf3ff9e34)
[<f8856ad9>] journal_commit_transaction [jbd] 0xed9 (0xf3ff9e68)
[<f885951a>] kjournald [jbd] 0x17a (0xf3ff9fb0)
[<f8859380>] commit_timeout [jbd] 0x0 (0xf3ff9fd4)
[<f88593a0>] kjournald [jbd] 0x0 (0xf3ff9fe4)
[<c01095ad>] kernel_thread_helper [kernel] 0x5 (0xf3ff9ff0)


Comment 3 John Schmidt 2004-09-23 12:12:32 UTC
Installed SLES 9 on the same Dell 2650 and the panic has not 
occurred.  This would seem to indicate a problem in the RH 2.4.21 
kernel or NPTL 0.60.  SLES 9 has the 2.6.5 kernel and NPTL 0.61.  We 
would like to help resolve this in whatever way we can.

Comment 4 John Schmidt 2004-09-24 12:10:01 UTC
I forgot to mention that the only changes to run on SLES were 
makefile related, e.g. the different locations of C++ headers.  The 
code itself was unchanged.  This problem is similar to that of 
stopping Lotus Domino except turning off the audit daemon did not 
help us.

Comment 5 Stephen Tweedie 2004-10-04 21:23:14 UTC
The other oopses here are:

kernel BUG at page_alloc.c:242!
invalid operand: 0000
nfs lockd sunrpc audit lp parport autofs4 tg3 e100 ipt_REJECT
ipt_state ip_connt
rack iptable_filter ip_tables floppy sg microcode keybdev mousedev hid
input u
CPU:    3
EIP:    0060:[<c0157d0f>]    Not tainted
EFLAGS: 00010286

EIP is at __free_pages_ok [kernel] 0xef (2.4.21-20.ELsmp/i686)
eax: f6e35300   ebx: c1dc84e8   ecx: 0003ace1   edx: 00000000
esi: f6e35300   edi: b74b4000   ebp: 00000000   esp: f6e0fdec
ds: 0068   es: 0068   ss: 0068
Process ecocComputer (pid: 2698, stackpage=f6e0f000)
Stack: c03a6480 00000002 c03a6364 00000286 fffffffe c03a76dc c1dc8524
c03a6480
       c1d2002c c03a7664 00000286 fffffffe 00000b38 0000008e c1dc84e8
b74b4000
       000000dc c013eaf2 c1dc84e8 0000008e 000000dc c013fbbd f7ba4380
f6e20dd0
Call Trace:   [<c013eaf2>] __free_pte [kernel] 0x52 (0xf6e0fe30)
[<c013fbbd>] zap_page_range [kernel] 0x1ed (0xf6e0fe40)
[<c0146d3a>] exit_mmap [kernel] 0xda (0xf6e0fe94)
[<c0126879>] mmput [kernel] 0x69 (0xf6e0feb8)
[<c012d596>] do_exit [kernel] 0x186 (0xf6e0fec8)
[<c012d92b>] do_group_exit [kernel] 0x8b (0xf6e0fee4)
[<c01372c0>] get_signal_to_deliver [kernel] 0x1f0 (0xf6e0fef8)
[<c010bef4>] do_signal [kernel] 0x64 (0xf6e0ff20)
[<c013c213>] do_futex [kernel] 0xe3 (0xf6e0ff58)
[<f8865e99>] ext3_file_write [ext3] 0x39 (0xf6e0ff74)
[<c013c2e9>] sys_futex [kernel] 0xb9 (0xf6e0ff88)

Code: 0f 0b f2 00 db bb 2b c0 8b 43 14 85 c0 0f 85 6c 02 00 00 b8

Kernel panic: Fatal exception


[root@svdevrhl30b root]# Unable to handle kernel paging request at
virtual addre
ss 00010003
 printing eip:
c01660e3
*pde = 2cbf9001
*pte = 00000000
Oops: 0002
nfs lockd sunrpc lp parport autofs4 audit tg3 e100 ipt_REJECT
ipt_state ip_connt
rack iptable_filter ip_tables floppy sg microcode keybdev mousedev hid
input u
CPU:    1
EIP:    0060:[<c01660e3>]    Not tainted
EFLAGS: 00010206

EIP is at get_unused_buffer_head [kernel] 0x83 (2.4.21-20.ELsmp/i686)
eax: 0000ffff   ebx: 00000000   ecx: 0000ffff   edx: 0000ffff
esi: 00000000   edi: 00001000   ebp: 00000001   esp: eee3fe1c
ds: 0068   es: 0068   ss: 0068
Process ecoagt (pid: 4619, stackpage=eee3f000)
Stack: c3ac2268 000000f0 c01661b8 00000001 00000000 efafa800 00000806
c1dfcf04
       dcd25680 c1dfcf04 c01665d6 c1dfcf04 00001000 00000001 c1dfcf04
efadc300
       c0166c0d c1dfcf04 00000806 00001000 000000f0 00000000 ff0c5000
eee3fe88
Call Trace:   [<c01661b8>] create_buffers [kernel] 0x28 (0xeee3fe24)
[<c01665d6>] create_empty_buffers [kernel] 0x26 (0xeee3fe44)
[<c0166c0d>] __block_prepare_write [kernel] 0x2fd (0xeee3fe5c)
[<f885337b>] new_handle [jbd] 0x4b (0xeee3fe84)
[<c0167479>] block_prepare_write [kernel] 0x39 (0xeee3fea0)
[<f88684e0>] ext3_get_block [ext3] 0x0 (0xeee3feb4)
[<f8868bb9>] ext3_prepare_write [ext3] 0xc9 (0xeee3fec0)
[<f88684e0>] ext3_get_block [ext3] 0x0 (0xeee3fed0)
[<c014b6e3>] do_generic_file_write [kernel] 0x1e3 (0xeee3fef4)
[<c014bc3f>] generic_file_write [kernel] 0x13f (0xeee3ff48)
[<f8865e99>] ext3_file_write [ext3] 0x39 (0xeee3ff74)
[<c01635a7>] sys_write [kernel] 0x97 (0xeee3ff94)

Code: c7 40 04 ff ff ff ff c7 40 2c 00 00 00 00 f0 fe 0d 08 80 3a

Kernel panic: Fatal exception


Unable to handle kernel paging request at virtual address ffffffc8
 printing eip:
c012d9a0
*pde = 00000000
Oops: 0000
nfs lockd sunrpc lp parport autofs4 audit tg3 e100 ipt_REJECT
ipt_state ip_connt
rack iptable_filter ip_tables floppy sg microcode keybdev mousedev hid
input u
CPU:    3
EIP:    0060:[<c012d9a0>]    Not tainted
EFLAGS: 00010246

EIP is at eligible_child [kernel] 0x20 (2.4.21-20.ELsmp/i686)
eax: ffffffff   ebx: ffffff40   ecx: ffffff40   edx: 00000000
esi: f6988000   edi: 00000000   ebp: f69880b8   esp: f6989f40
ds: 0068   es: 0068   ss: 0068
Process sh (pid: 1850, stackpage=f6989000)
Stack: c012de7e ffffffff 00000000 ffffff40 00010206 00000000 00000001
00000000
       f6988000 00000000 00000000 00000000 00000000 04000000 0013eeb8
00000000
       f6988000 f6988170 f6988170 00000000 08074020 04000000 0013eeb8
f6988000
Call Trace:   [<c012de7e>] sys_wait4 [kernel] 0xde (0xf6989f40)
[<c012e0b7>] sys_waitpid [kernel] 0x27 (0xf6989fac)

Code: 8b 81 88 00 00 00 83 f8 ff 74 5c 85 d2 79 51 83 f8 11 74 3c

Kernel panic: Fatal exception

Comment 6 Stephen Tweedie 2004-10-04 21:28:26 UTC
You also refer to "the aacraid panic" but that's not described
anywhere: do you have information about that panic too?

The oopses above show little except for vague evidence of random
memory corruption.  There's almost no other common element in them. 
We'd really need something more concrete to pursue this --- a
reproducer that we can run ourselves, for example, or a crash dump, or
a much better description of what precise behaviour triggers the problem.


Comment 7 John Schmidt 2004-10-04 21:48:00 UTC
The aacraid-induced panic is 131703.  I followed the steps to work
around it as described in that bug.  I wish I knew what (if anything)
we can control (code-wise) to prevent these panics.  It seems 
something in the ServerVantage startup process (minimum 3 processes) 
is exposing|causing these panics.  It occurs sooner running an SMP 
kernel.  It does not occur on RH 9.0 or SLES 9 (SLES 9 required 
recompiling the same code).  I'm working on 2 tasks now; 1- make 
xconfig and load ../arch/i386/defconfig and change ONLY 
CONFIG_IKCONFIG to y, recompile, etc. and run that kernel (RHEL3 
update 2), 2- Download the RHEL4 beta and try running SV there.  

I can provide a crash dump provided I get explicit instructions on 
going about capturing it.

Comment 8 Stephen Tweedie 2004-10-11 22:58:05 UTC
RHEL3 supports network dumps, but not disk dumps.  There's a whitepaper at

http://www.redhat.com/support/wpapers/redhat/netdump/index.html


Comment 9 John Schmidt 2004-10-13 20:58:30 UTC
OK after 4 tries I finally captured both log and vmcore on my netdump-
server VMware image.  The client VMware image is config'd with 256M 
so how do I get it to you?  gzipped it's down to 72M+ and Bugzilla 
won't let me attach it, doubt we have a publicly available URL you 
could get it from... email, ftp?!

Comment 10 John Schmidt 2004-10-13 21:06:04 UTC
Created attachment 105163 [details]
netdump log file.

Here's the log file anyway...

Comment 11 John Schmidt 2004-10-13 22:03:22 UTC
Moved 2 TAR+gzipped files to:

ftp://ftp.compuware.com/pub/vantage/server/outgoing/rhel3_132838/netdu
mp1_files.TAR

ftp://ftp.compuware.com/pub/vantage/server/outgoing/rhel3_132838/netdu
mp2_files.TAR

Each contains log and vmcore files.  The 2nd panic occurred on the 
automatic reboot when logging in as root using KDE, process in the 
Oops is kdeinit.

Comment 12 Stephen Tweedie 2004-10-14 11:14:03 UTC
I cannot access those files:

$ lftp ftp://ftp.compuware.com
lftp ftp.compuware.com:~> cd pub/vantage/server/outgoing/
cd: Access failed: 550 /pub/vantage/server/outgoing: Access is denied.



Comment 13 John Schmidt 2004-10-14 12:35:45 UTC
If you use the full URL including netdump1_files.TAR in the browser 
Address it should prompt you to Open, Save, etc.

Comment 14 John Schmidt 2004-10-18 13:15:24 UTC
Added another set of vmcore & log files at (enter full URL to 
download):
ftp://ftp.compuware.com/pub/vantage/server/outgoing/rhel3_132838/netdu
mp3_files.TAR.

Comment 15 John Schmidt 2004-10-28 15:50:30 UTC
Tried raising priority to High but I guess the "Reporter" is not the 
same as the "submitter" so it wouldn't take.  Now getting panics 
simply running the tar command during ServerVantage (9.7) 
installation before ANY of our code has a chance to run.  Unable to 
recreate (using restart_ecoagt processing) on RHEL 4 Beta.

Comment 23 John Schmidt 2004-12-09 19:03:19 UTC
I don't have access to RHN Errata (up2date) to download individual 
kernel packages as we have only 5 machines configured for up2date.  I 
can request one of the 5 IT machines be up2date'd or if "update 4" is 
available in ISO images for download, install them on my VMware image.
I checked Easy ISOs on RHN and only see Update 3.

Comment 24 Tom "spot" Callaway 2004-12-10 02:21:09 UTC
John, you should be able to login to https://rhn.redhat.com, then go to:
https://rhn.redhat.com/network/software/channels/downloads.pxt?cid=1186

That link should take you to the Update 4 Beta ISO images. Update 4
has not been released yet in final form.

However, I think that we would like you to try and test with a kernel
that has slab-debug enabled. Watch this space, we'll post a link.

Comment 27 Dave Anderson 2004-12-10 18:02:48 UTC
In this location: http://people.redhat.com/anderson/.BZ_132838

there is a U4 kernel with CONFIG_DEBUG_SLAB turned on.
It's a UP kernel, which is more useful as far as slab
debugging is concerned because the UP kernel doesn't use 
per-cpu slab object caches, which aren't affected by the
slab debug code.

The directory contains four files:

kernel-2.4.21-27.slab_debug1.EL.i686.rpm
kernel-debuginfo-2.4.21-27.slab_debug1.EL.i686.rpm
vmlinux-2.4.21-27.slab_debug1.EL
vmlinux-2.4.21-27.slab_debug1.EL.debug

but only kernel-2.4.21-27.slab_debug1.EL.i686.rpm needs to be
downloaded, installed, and rebooted:

 $ rpm -ivh kernel-2.4.21-27.slab_debug1.EL.i686.rpm

Please ensure that netdump is still enabled, and then try to get
us a netdump or two.

The other files in the directory consist of the kernel debuginfo
package, and for convenience sake only, the vmlinux and vmlinux.debug
files extracted from the two binary RPMs.  These will only be of use
for subsequent analysis of any dumpfiles.

In any case, the slab debug code is not a panacea for all slab
corruption problems, but hopefully help trap the problem at hand
at an earlier stage.

FWIW, I'll also attach my notes re: the slab corruption in the
first two dumps to this BZ for future reference if necessary.




Comment 28 Dave Anderson 2004-12-10 18:05:54 UTC
Created attachment 108334 [details]
crash analysis notes for netdump #1

Comment 29 Dave Anderson 2004-12-10 18:07:41 UTC
Created attachment 108336 [details]
crash analysis notes for netdump #2

Comment 30 John Schmidt 2004-12-13 14:37:34 UTC
Installed U4 beta for grins, panic still happened.  Applied slab 
debug kernel and captured vmcore and log.  Please download from:
ftp://ftp.compuware.com/pub/vantage/server/outgoing/rhel3_132838/netdu
mp_debug.TAR.gz

Remember to enter the full URL in your browser to download.

Comment 31 John Schmidt 2004-12-13 15:10:40 UTC
Kernel panicked again when logging in using KDE after the auto-
reboot, the second time this happened.  The ServerVantage init.d 
script and rc*.d symlinks had been removed so we were not involved.  
The running process was artsd.

ftp://ftp.compuware.com/pub/vantage/server/outgoing/rhel3_132838/netdu
mp2_debug.TAR.gz

Comment 32 Dave Anderson 2004-12-14 20:10:32 UTC
Created attachment 108560 [details]
crash analysis notes for 
vmlinux-2.4.21-21.EL kernel

Comment 33 Dave Anderson 2004-12-14 20:13:13 UTC
Created attachment 108562 [details]
crash analysis notes for slab-debug kernel

Comment 34 Dave Anderson 2004-12-14 20:29:03 UTC
The 2.4.21-21.EL and the 2.4.21-27.slab_debug1.EL kernel panics, 
like the previous two, seemingly have no relationship in their
end results -- other than the fact that all 4 dumpfiles show
corruption in the size-2048 slab at a minimum, and in some cases,
more than that particular slab.  

The slab-debug kernel's protection mechanism obviously didn't
catch anything in the act of a double free, which would seemingly
have been the case, since all the dumps have size-2048 slab
chains (the partial and full) intermingled.  It's not clear to me
how else they could get into that state without being "caught"
by the slab debug code.

What I'm wondering now is exactly *when* does this corruption occur.

If a "cat /proc/slabinfo" were to be done with the size-2048 slab
chains in the state seen in the dumpfiles, one of the following
3 BUG()'s would panic the system:

        list_for_each(q,&cachep->slabs_full) {
                slabp = list_entry(q, slab_t, list);
                if (slabp->inuse != cachep->num)
                        BUG();
                active_objs += cachep->num;
                active_slabs++;
        }
        list_for_each(q,&cachep->slabs_partial) {
                slabp = list_entry(q, slab_t, list);
                if (slabp->inuse == cachep->num || !slabp->inuse)
                        BUG();
                active_objs += slabp->inuse;
                active_slabs++;
        }
        list_for_each(q,&cachep->slabs_free) {
                slabp = list_entry(q, slab_t, list);
                if (slabp->inuse)
                        BUG();
                num_slabs++;
        }

Can you simply boot the system, don't run anything, but just go
into a virtual terminal window or serial console preferably,
and enter "cat /proc/slabinfo"?

And then perhaps try a few things that have preceded the previous
crashes, and do another cat of /proc/slabinfo.  There must be some
act that causes the corruption, and perhaps it can be tracked down
in this manner.

Other than that, perhaps another run with the slab-debug kernel
would produce a netdump that would give us more clues.



Comment 35 John Schmidt 2004-12-14 20:57:30 UTC
I forgot to update that the 2nd crash-dump on the auto reboot had 
picked up the default U4 kernel.  I've updated the grub/menu.lst to 
default to the slab_debug kernel.  I'll add the cat of /proc/slabinfo 
to my start_loop script so it can be captured between calls to 
restart_ecoagt script which fires up the SV agent processes.

Comment 36 John Schmidt 2004-12-15 15:48:02 UTC
Added another crash dump to the CW FTP site:

ftp://ftp.compuware.com/pub/vantage/server/outgoing/rhel3_132838/netdu
mp3_debug.TAR.gz

I added slabinfo.log as well, which was the cat /proc/slabinfo grep'd 
for "size-2048" which didn't look unusual just before it crashed.

Comment 37 John Schmidt 2005-01-05 21:43:49 UTC
How about we provide you our Linux Agent binaries so that you can run 
it on your system(s) using whatever debug kernel you require?  The 
Control Server component which runs on Windows is not required to 
recreate the panics we're exposing.

Comment 38 Dave Anderson 2005-01-05 21:59:48 UTC
Created attachment 109403 [details]
notes for last dumpfile sent

Comment 39 Dave Anderson 2005-01-05 22:03:23 UTC
FWIW, I attached the notes for the last dumpfile, finding essentially
the same problem -- two corrupt slab caches, the size-2048 as has been
the case in all prior dumps, as well as size-32 cache as well. 
But the dump trace is unrelated to the others, probably due to
previous slab cache corruption.

In any case, absolutely, if you can give us a reproducer, it would
be to both of our benefits!  Tell us what to do!

Thanks, 
  Dave

Comment 40 John Schmidt 2005-01-07 00:29:49 UTC
Added linux_ia32.TAR.gz to the Compuware FTP website, point your 
browser at the following URL to download it:

ftp://ftp.compuware.com/pub/vantage/server/outgoing/rhel3_132838/linux
_ia32.TAR.gz

Once you have it downloaded to /tmp on a test system running RHEL 3 
U4, uncompress it.  Then extract the install file to install the 
linux_ia32.TAR file, e.g.

./install

which will walk you through an install.  Any ?? email call me at 313-
227-6779.


Comment 41 John Schmidt 2005-01-07 00:39:16 UTC
I forgot... to recreate the problem there's another script in 
linux_ia32.TAR.gz that is normally not shipped, start_loop.  It 
invokes /usr/ecotools/bin/restart_ecoagt every 5 seconds.  To speed 
things up between restarts you can reduce the sleep in start_loop and 
the sleeps in stop_agt...

Comment 42 Dave Anderson 2005-01-07 15:06:34 UTC
Got this far -- but don't know what to do re: "control server" info:

Select one of the following options:
  1) Unload agent station software
  2) Install agent station software
  3) Configure monitoring of databases/applications
  4) Transfer agent station software to remote agents
  Q) Quit installation

Enter your selection [Q] => 2

Checking for NPTL.
Threads: NPTL 0.45
Found Native POSIX Threads Library!



Enter the full pathname of the directory to store temporary files
[/usr/tmp/ServerVantage/tmp] =>

Enter the full pathname of the directory to store log files
[/usr/tmp/ServerVantage/tmp] =>


Enter the full pathname of the directory to store data files
[/usr/tmp/ServerVantage/datafiles] =>

During the install process, please specify the Control Server by hostname
or by IP address. Using the IP option is best for complex networks,
including
firewalls, multiple local NICs, or lack of DNS resolution.


Does a firewall exist between this agent station and the control
server (y/n) [n] =>

The current control server configuration is:
  CS Hostname               : No current value
  CS IP address             : No current value
  Event port                : No current value
  TCP port                  : No current value
  Ecoagt RPC min port       : No current value
  Ecoagt RPC num ports      : No current value
  Ecoagt Localhost IP       : No current value
  Ecoagt Localhost Alias    : No current value

Enter the host name of the control server =>

Comment 43 John Schmidt 2005-01-07 15:39:48 UTC
You can use any hostname or IP address that is listed in /etc/hosts 
but do not test the CS connection which follows, just enter n for 
that test.

Comment 44 John Schmidt 2005-01-10 18:56:06 UTC
Have you been able to get the SV Linux agent installed successfully?  
Once the Control Server is entered the other config parms can be 
defaulted.

Comment 45 Dave Anderson 2005-01-10 20:10:40 UTC
Sorry John -- I haven't been able to get the change to retry it.
I'm working on another reproducible slab cache corruption issue that 
may be related to this one; if it turns out not to be, I'll get back
on this one as soon as I can.

Comment 46 Dave Anderson 2005-01-10 21:33:47 UTC
As it turns out, after revisiting the original dumps from this case,
the error signature is the same as with the other case I'm currently
working on.  The size-2048 cache is not being corrupted in my other
case, but the data structure corruption seen shows exactly the same
thing as with this case.  I didn't realize it at the time, but in
this case and in my other reproducer, data structures from the slab
cache are being over-written by a piece of an active task_struct. 
We're trying various strategies to "catch it in the act", but since
it's really not a case of a double-free, or other slab cache
mishandling, the slab-debug code is not catching it.

In any case, I may come back to using ServerVantage as the reproducer,
but I just wanted to let you know that we're attacking this issue
with the highest priority.



Comment 47 John Schmidt 2005-01-10 21:43:07 UTC
Good news I guess... will wait to hear from you.

Comment 48 Jeff Burke 2005-01-11 13:46:11 UTC
Created attachment 109604 [details]
Installation log file from installing ServerVantage

This was one of the errors in the install.log
  /opt/ServerVantage/bin/ecoconfig: error while loading shared libraries:  
libeco_core.so: cannot open shared object file: No such file or directory

The installation continued but the ecoagt would never start.

Please verify the installation.log file. Let me know if you need any additional
information.

Comment 49 Jeff Burke 2005-01-11 14:14:41 UTC
Created attachment 109607 [details]
Result from start_loop script

After manually setting the LD_LIBRARY_PATH variable to what is in the
/etc/init.d/ServerVantage LD_LIBRARY_PATH. We started the loop test. The log
file has the output of the test script.

Do we need to have java installed fo rhtis application to work properly?

Comment 50 John Schmidt 2005-01-11 14:28:04 UTC
Hmmmm... restart_ecoagt should be setting LD_LIBRARY_PATH, 
ECOBOOTSTRAP, and ECOHOME.  There should be a logerror.ecoagt.<pid> 
in /opt/ServerVantage/tmp.  I think the problem is in the 9.1 agents 
we look for a FlexLM license on the agents.  Let me get you a 9.7 
agent that doesn't require licensing checks on the agents, it'd done 
on the Windows CS....  I'll post another URL for the 9.7 agent once I 
get it on the FTP site.

Comment 51 John Schmidt 2005-01-11 14:33:36 UTC
OK the 9.7 agent is here:

ftp://ftp.compuware.com/pub/vantage/server/outgoing/rhel3_132838/linux
_ia32.TAR.Z

Sorry about that...

Comment 52 Jeff Burke 2005-01-11 14:57:21 UTC
John,
    I did an unistall of the previous version. That removed all teh directories
and files.  When I opened the new tar file the install script is missing also
your  start_loop script is not in the teh tar file.

    Did you want me to install of this version right on top of the other version
with out doing an unistall?

Jeff

Comment 53 John Schmidt 2005-01-11 15:01:07 UTC
You can install over the old directories, just do 1) Unload.  can you 
still extract the start_loop script from the original TAR file?

Comment 54 Dave Anderson 2005-01-11 21:11:39 UTC
The test ran just fine this time.  Unfortunately it did so about
500 times in a row without failing, so we're going back to our other
test scenario that leads to the same result.  Thanks, anyway...

We'll keep this case posted when we come up with something.

Comment 55 John Schmidt 2005-01-11 21:15:35 UTC
It happens sooner on an SMP kernel.  On a single CPU box I've seen it 
go for 2-3K restarts before it panics.  Be patient, if you run it it 
will panic.  Also seems to happen pretty quick on a Dell PowerEdge 
2650 running SMP kernel.

Comment 56 Dave Anderson 2005-01-11 21:31:51 UTC
We were running it on a SMP kernel running on a UP box.  We just
thought we'd get it to happen quicker with this test than the
one we'd been using, which typically takes an hour or so.
It really doesn't make much difference what test we use, and
we can make it happen fairly quickly with our test.

Comment 57 Stephen Karniotis 2005-01-13 20:15:53 UTC
Dave:  

   Our VPs are asking for a status on the testing this bug.  Can you 
please provide an update?

Comment 58 Dave Anderson 2005-01-13 20:28:43 UTC
Sure.  We know what the "error signature" is, but have not yet come
up with a way to catch it:  The last 496 bytes of a task_struct (in
our test case, that of the currently-running task), are being errantly
copied to the beginning of a slab cache page.  The problem is figuring
out when and how it happens; by the time we bump into it and the
system crashes, it's well past the time of corruption.  We're adding
debug code to various data-move routines, checking for source
addresses (i.e., in the current task), and for destination addresses
that are in the typically-targeted slab caches (in our tests, they
predictably end up corrupting a page in the inode cache, the dentry
cache or the size-128 cache).  All I can say is that it's getting
full-time attention, with no resolution as of yet.

   

Comment 59 Jeff Burke 2005-01-18 13:51:09 UTC
John,
    In the original post this issue was being seen on a Dell PowerEdge
2650. Could you please provide a little more information on that
hardware configuration? 

    Specifically I am looking for the memory configuration. But the
more data the better.  If the system has not been modified from the
factory ship, I can get all the information I need if you pass along
the Dell Service Tag.

Thanks in Advance




Comment 60 John Schmidt 2005-01-18 14:05:27 UTC
Jeff,

The "tag" is JLDRY11 but this also happens in a VMware image on a 
Dell OptiPlex GX260, also IBM Zseries at customer sites among others.


Comment 61 Jeff Burke 2005-01-21 12:53:26 UTC
John and Stephen,
 The following test kernels (plus the kernel-source RPM) are available
under this Red Hat people page:

http://people.redhat.com/~petrides/.kcore/kernel-2.4.21-27.0.2.EL.ernie.kcore.1.i686.rpm
http://people.redhat.com/~petrides/.kcore/kernel-smp-2.4.21-27.0.2.EL.ernie.kcore.1.i686.rpm
http://people.redhat.com/~petrides/.kcore/kernel-hugemem-2.4.21-27.0.2.EL.ernie.kcore.1.i686.rpm
http://people.redhat.com/~petrides/.kcore/kernel-source-2.4.21-27.0.2.EL.ernie.kcore.1.i386.rpm

Please let us know whether this proposed fix resolves the data
corruption problem you've encountered.  If you need a different RPM,
just list that here and one of us will make it available to you.

Thanks in advance.

Comment 62 John Schmidt 2005-01-21 15:20:16 UTC
Thanks Jeff... I will install it and give it a try and probably get 
back to you Monday unless it panics today... hopefully it doesn't 
panic and runs the w/e!

Comment 63 Jeff Burke 2005-01-25 13:38:06 UTC
John,
   Just following up. All my test ran succesfully over the weekend.
Any word on how your testing is going?

Thanks Jeff

Comment 64 John Schmidt 2005-01-25 16:32:52 UTC
Jeff,

So far the kernel patch looks good.  I've hit 9000+ restarts (of 
ecoagt) without seeing any panics.  Since we've seen it rather 
quickly and predictably on a dual-CPU machine I'd like to test it 
there as well.  I'm currently waiting for access to a dual-CPU 
machine.

Regards,
John

Comment 65 Dave Anderson 2005-01-25 18:54:03 UTC
John,

Just to clarify what the problem is, and why we're confident that
it addresses your problem.

In your libeco_compsys.so.9.1 library, there is this:

$ strings libeco_compsys.so.9.1 | grep kcore
file /proc/kcore | awk '{print $3}' | awk -F- '{print $1}' 2>/dev/null
$

/proc/kcore is read to determine whether the kernel is 32-bit.
Your install script does the same thing to determine "OS_BITS",
although it would only do it one time...

When the pseudo-file /proc/kcore is accessed, the kernel dynamically
creates a "fake" ELF header, which is then copied out to the user
space program (the "file" command in your case).  It typically only
needs 1 page to create the ELF header, and so a single page is
allocated in the kernel.  However, there are circumstances when,
depending upon the number of vmalloc() calls that have been made
during your kernel's run-time, where it needs more than the 1 page
that was allocated.  The over-run consists of the tail-end of a copy
of the currently-running task's task_struct, which flows into the
beginning of the next page after the allocated one.  Typically (but
not necessarily) this tends to be slab cache page, and the corruption
is usually not encountered until a much later point in time.
 
We were re-creating the problem by simply doing a tar of /proc, which
also pre-determines the type of file /proc/kcore is before reading it.
 That was far more deterministic in the re-creation of the problem,
since the tar process was consuming most of memory with dentry and
inode slab cache data.



In any case, the fix was to correctly calculate the size of the
ELF header.




Comment 66 John Schmidt 2005-01-26 14:25:42 UTC
Dave, Jeff,

It looks like the panic situation starting SV has been resolved for I 
still have not encountered a panic on a single CPU image.  Before we 
inform our customers it is safe to run us in a production environment 
we're still going to test on a dual-CPU machine.  Thank you and the 
others involved for your help.

Regards,
John

Comment 67 Stephen Karniotis 2005-01-26 14:28:55 UTC
Good morning all:

   Before we close out this bug, I request to allow John to complete 
all testing in multi-CPU environments.  I want to make sure our 
customers are covered and can be assured that the product is safe to 
run in their production environments.  Otherwise, we will be back 
with this issue again.

Comment 68 John Schmidt 2005-01-27 16:52:45 UTC
It looks to be running good on the dual-CPU as well, 3600+ restarts 
without a panic.  The "official" kernel patch that is being tested by 
RH QA, will it be the same as what I've tested with?

Regards,
John

Comment 69 Ernie Petrides 2005-01-28 23:59:03 UTC
Hi, John.  The official patch that I'm going to commit to U5 tonight
has had one minor improvement for maintainability (which was suggested
during code review).  However, the functionality will match exactly
what you're already testing.


Comment 70 Ernie Petrides 2005-01-29 06:10:01 UTC
A fix for this problem has just been committed to the RHEL3 U5
patch pool this evening (in kernel version 2.4.21-27.10.EL).


Comment 71 John Schmidt 2005-01-31 13:11:42 UTC
Hi Ernie,

Will there be a separate kernel patch or will it be available only in 
U5?

-John

Comment 72 Ernie Petrides 2005-02-01 00:51:39 UTC
Created attachment 110472 [details]
/proc/kcore fix committed to RHEL3 U5

Hi, John.  I've attached the exact patch that was committed to U5
last week.  You've been testing with a patch that is only slightly
different in the 1st patch hunk, but the functionality is the same.

We don't currently have plans to release a pre-U5 erratum with this
fix, although our support organization might consider making a U5
"Hot Fix" kernel (based on my interim U5 build Friday) available to
select customers after it's had a little Q/A.  (Hot Fix kernels are
snapshots of the next Update-in-progress, and thus include everything
we've committed to U5 so far -- about 80 fixes at this point.)

Comment 73 John Schmidt 2005-02-01 13:56:24 UTC
Hi Ernie,

Do you have an ETA for Update 5?  I'm considering the removal of the 
file command against /proc/kcore from our code and install script 
since we can just default to 32bit.  However we have to run tar to 
extract ourselves so we'd still expose our customers to this problem 
until U5 is released.

Regards,
John

Comment 74 Ernie Petrides 2005-02-01 23:59:36 UTC
John, U5 beta is currently scheduled to start mid-March, and
final release is currently scheduled for beginning of May.


Comment 75 Stephen Karniotis 2005-02-08 20:40:14 UTC
Ernie:  This is Stephen Karniotis at Compuware.  We have identified another 7+ 
mutual customers that have incurred this problem with expectations of more as 
they deploy our Linux Agent.  We need to get this patch in their hands before 
May.  We would prefer a Hot Fix for this if possible.  We are also open to 
getting an agreement created from Red Hat to allow us to distribute this to 
our Premier Customers and allow them to download from either our site or 
yours.  Please discuss with Bret Hunter in the Alliance Organization as well 
as your management and have someone call me to discuss.  My direct number is 
(313) 227-4350; wireless is (248) 408-2918.  Need a resolution very soon.

Comment 76 Ernie Petrides 2005-02-09 01:18:33 UTC
Hello, Stephen.  I'm just a lowly engineer (and RHEL3 kernel pool
maintainer).  All I can tell you is that a RHEL3 pre-U5 kernel with
this fix has already been built, and it is a viable candidate for a
"Hot Fix" kernel that could be provided by our Customer Support
organization.  Since Bugzilla is simply a bug tracking tool, I'd
recommend that you engage Customer Support directly (indicating
that the fix you want is in kernel version 2.4.21-27.10.EL or later).


Comment 78 Michael Waite 2005-04-13 15:55:54 UTC
I am attached to this issue from the pertner team.
I am waiting for Stephen Karniotis to get back to me.
-----Mike


Comment 79 Dave Anderson 2005-04-13 17:03:15 UTC
This issue was marked MODIFIED by Ernie Petrides as part of
his errata tracking procedure when he puts the fix into
the RHEL3 source tree.  Why was it put back to ASSIGNED state?

Comment 80 Ernie Petrides 2005-04-13 19:11:01 UTC
Reverting to MODIFIED state until MikeW provides evidence that
this problem has not been fixed in RHEL3 U5 beta.

Comment 81 Michael Waite 2005-04-13 19:52:02 UTC
I have no idea what that means.
Right now, Compuware is eagerly awaiting the update release that contains the fix.
Their customer base is all operating on hacked up workarounds right now until we
can get them the drop with the fix.


Comment 83 Dave Anderson 2005-04-13 20:30:38 UTC
The "Bug Activity" shows that you (Mike) changed the status from
MODIFIED to ASSIGNED at the same time you made your first post.
I don't know whether you did that manually, or if it got switched
automatically somehow?



Comment 84 Tim Powers 2005-05-18 13:28:08 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-294.html



Note You need to log in before you can comment on or make changes to this bug.