Bug 108481 - LTC5196-Kernel panic under moderate load
LTC5196-Kernel panic under moderate load
Status: CLOSED WORKSFORME
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
i386 Linux
high Severity high
: ---
: ---
Assigned To: Stephen Tweedie
Mike McLean
:
Depends On:
Blocks: 107562
  Show dependency treegraph
 
Reported: 2003-10-29 18:08 EST by IBM Bug Proxy
Modified: 2007-11-30 17:06 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2003-12-18 15:01:20 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
log of kernel crash via netdump (310.33 KB, text/plain)
2003-11-11 16:40 EST, IBM Bug Proxy
no flags Details
Log file from Kenbo (141.25 KB, text/plain)
2003-11-14 10:20 EST, mark wisner
no flags Details
log from from IBM (141.25 KB, text/plain)
2003-11-14 10:22 EST, Bob Johnson
no flags Details
netdump log (141.25 KB, text/plain)
2003-11-15 15:57 EST, IBM Bug Proxy
no flags Details
log.newkernel (141.25 KB, text/plain)
2003-11-15 16:07 EST, IBM Bug Proxy
no flags Details
info.tar (50.00 KB, application/x-tar)
2003-12-09 17:27 EST, IBM Bug Proxy
no flags Details
netdump log sct kernel (130.07 KB, text/plain)
2003-12-10 09:12 EST, IBM Bug Proxy
no flags Details

  None (edit)
Description IBM Bug Proxy 2003-10-29 18:08:29 EST
The following has be reported by IBM LTC:  
Kernel panic under moderate load
Hardware Environment:

Intel 4-way 2.8Ghz, 4GB Ram, 2Gb Swap

Software Environment:
RHAS 3.0 RC3 latest

Steps to Reproduce:
1. Install/setup Domino
2. Run testnsf load against server of 1000 users
3.

Actual Results:
Kernel panics after some period of time

Expected Results:
Should run indefinately

Additional Information:

In /var/log/messages, last line before panic was:

Oct 28 13:17:33 rhas30 kernel: VFS: brelse: Trying to free free buffer

Last output on test was done at Oct 28 13:18:51, according to output in nohup file.

This is the panic which appeared on the console (part that was visible)

put usb-ohci usbcore ext3 jbd aic79xx sd_mod scsi_mod
CPU:	3
EIP	0060:[<c01622e4>]	Not tainted
EFLAGS:	00010246

EIP is at __remove_from_queues [kernel] 0x14 (2.4.21-4.0.1.EL.db2testsmp)
eax: 00000000	ebx: defda82c	ecx: defda82c	edx: 0000be30
esi: defda82cd	edi: defda82c	ebp: c1c4be30	esp: c65fdf54
ds: 0068	es: 0068	ss: 0068
Process kswapd (pid: 19, stackpage=c65fd000)
Stack:	c01659ae defda82c c1c4 be30 c1c4be4c defda82c c03a5000 c0151e3b c1c4be30
	000001d0 c1c4be44 00000001 000032e1 c03a5000 c03a61c0 00000035 c0153320
	c03a5000 000001d0 c1c4be30 c03a5000 00000040 000001d0 00002656 c0153797
Call Trace:	[<c01659ae>] try_to_free_buffers [kernel] 0x8e (0xc65fdf54)
[<c0151e3b>] launder_page [kernel] 0x86b (0xc65fdf6c)
[<c0153320>] rebalance_dirty_zone [kernel] 0xa0 (0xc65fdf90)
[<c0153797>] do_try_to_free_pages_kswapd [kernel] 0x187 (0xc65fdfb0)
[<c01538b8>] kswapd [kernel] 0x68 (0xc65fdfd0)
[<c0153850>] kswapd [kernel] 0x0 (0xc65fdfe4)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc65fdff0)

Code: 89 02 c7 41 34 00 00 00 00 89 4c 24 04 e9 7a ff ff ff 8d 76

Kernel panic: Fatal exceptionKenbo - can you try the GA version of RHEL3 and see
if you still see the
panic ?Kenbo, Did you use one of the kernels provided by Red Hat to fix LTC
https://bugzilla.linux.ibm.com/show_bug.cgi?id=4829 
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=106794 
reported by DB2?
Comment 1 IBM Bug Proxy 2003-10-30 10:19:53 EST
------ Additional Comments From kenbo@us.ibm.com  2003-30-10 10:08 -------
The first time the system "crashed" was with the standard kernel - I was 
unable to get any info as the screen was empty on the system.  I ran my test 
again, but after 18 hours the system was till good.  Thinking this might be 
related to the DB2 issue, I then installed the DB2test smp kernel and began my 
testing anew.  To help make the issue show, I added a 1000 user test load to 
my existing test - this never really stressed the machine, other than perhaps 
the disk subsystem.  During testing, I crashed the system and got the panic 
shown here.  I then restarted the testing again, and overnight the system 
crashed but this time I was again unable to see the kernel panic on the 
display.  I am presently trying to revert to the standard smp kernel.  
However, it should be noted that the panic is not anywhere near the code which 
was changed for DB2 - that code related to the IPC system and my panic is with 
the kernel swapd. 
Comment 5 Rik van Riel 2003-10-30 16:02:05 EST
I can't find a direct way to get from the printk() back to page_launder(), this
seems more subtle than just reading the code.  Stephen?  Arjan?  Al?  Would you
happen to know something that could cause this bug?

(I'm reading the code more now, to see if I can find any bug)
Comment 6 Rik van Riel 2003-10-30 16:20:28 EST
OK, Stephen and I have discussed the bug and we will prepare a kernel with ext3
debugging enabled and a BUG() line after the 'printk(KERN_ERR "VFS: brelse:
Trying to free free buffer\n");', so we can find out who is double-freeing a buffer.
Comment 7 IBM Bug Proxy 2003-10-31 10:51:46 EST
------ Additional Comments From kenbo@us.ibm.com  2003-31-10 10:47 -------
Ok, I reverted back to the standard RHAS 3.0 kernel-smp and restarted my 
tests.  The testing ran for ~4 hours before the system hung.  I do not know if 
it was a kernel panic or a hang, as none of my windows displayed any 
information, including the console login on the machine itself which was blank 
and unresponsive.  So, any ideas on what I can do to try to help and debug 
this issue?  The test had only ramped up to ~800 users when the failure 
occurred - it ramps up very slowly.  This inability to run such a small load 
will block us from being able to support RHAS 3.0 - with this machine I would 
expect to be able to support upwards of 4000 users in a single partition. 
Comment 9 Stephen Tweedie 2003-11-03 10:39:05 EST
Could you please try setting up either a serial console or netconsole
so that we can trap kernel messages when the fault occurs?

I'm going to get a kernel built through our build system with the
"buffer already freed" error turned into a BUG(), so that we can
capture more information when it recurs.
Comment 10 IBM Bug Proxy 2003-11-03 14:11:03 EST
------ Additional Comments From kenbo@us.ibm.com  2003-03-11 14:06 -------
I'm trying to get ahold of a serial cable.  In the meantime, would anyone 
happen to have a link to somewhere showing how I can use netconsole and I will 
try that as well?  Thanks! 
Comment 11 Stephen Tweedie 2003-11-03 14:57:06 EST
There are docs in the "netdump" (client-side) and "netdump-server"
(server-side) packages on RHEL3.  Let me know if you need any help
getting it set up.
Comment 12 Tim Burke 2003-11-11 10:59:01 EST
Setting state to needinfo. Stephen is working on a debug kernel for
IBM to try.  Currently, given the lack of concrete info on this one,
its prospects are looking bleak for resolution in the U1 timeframe.
Comment 13 IBM Bug Proxy 2003-11-11 12:13:05 EST
------ Additional Comments From kenbo@us.ibm.com  2003-11-11 12:06 -------
I think I have netdump setup correctly now and I'm restarting my tests to see 
if I can capture anything with it.  DOing this required 2 RHAS 3.0 systems. 
Comment 14 IBM Bug Proxy 2003-11-11 16:40:00 EST
Created attachment 95912 [details]
log of kernel crash via netdump
Comment 15 IBM Bug Proxy 2003-11-11 16:40:19 EST
------ Additional Comments From kenbo@us.ibm.com  2003-11-11 15:51 -------
 
Log of kernel crash via netdump

Ok, the system crashed and appeared to try and do a netdump but it was
incomplete.  On my system running the netdump-server, in /var/crash I have a
directory named with my netdump client machines ip and date of the crash. 
Inside of it I have a log file and a vmcore-incomplete, but the vmcore file is
zero length.  I'm including the log file.  Thanks! 
Comment 16 Arjan van de Ven 2003-11-11 16:42:33 EST
the m

Assertion failure in __journal_file_buffer() at transaction.c:2015:
"jh->b_jlist < 9"
------------[ cut here ]------------
kernel BUG at transaction.c:2015!


Comment 17 Stephen Tweedie 2003-11-11 18:18:36 EST
That last assertion failure is consistent with the previous ones:
we've got a freed-but-inuse buffer_head, ethe fields in the bh have
been reused for something else now that they are free, and we're
seeing garbage here as a result.  

Were there any instances of the "Trying to free free buffer" message
in the log this time?

A kernel build has just completed here with that warning turned into a
BUG(), but it looks as if we'll need more significant debugging to
catch this.  I'll give the kernel a quick test then post its location
here.
Comment 18 IBM Bug Proxy 2003-11-13 10:51:50 EST
------ Additional Comments From kenbo@us.ibm.com  2003-13-11 10:48 -------
I misunderstoond what was being asked with "Were there any instances of 
the "Trying to free free buffer" message in the log this time?".  No, I looked 
in /var/log/messages and there is no line this time.  Thanks! 
Comment 19 Stephen Tweedie 2003-11-13 12:02:35 EST
http://people.redhat.com/sct/.private/test-kernels/2.4.21-4.1.sct1/

has the test kernel with the appropriate VFS BUG() added to trap the
double-free.  If you could run that with netdump dumping enabled that
would be a great help.

The kernel has had only basic testing, but it's a one-line change from
the 4.EL gold kernel.
Comment 22 mark wisner 2003-11-14 10:20:34 EST
Created attachment 95968 [details]
Log file from Kenbo

netdump log on crash after changing to test kernel

I am still not getting a vmcore-incomplete file (zero length), but here is the
netdump log for crash which happened overnight under load.  I am running the
2.4.21-4.1.sct1smp kernel.  Restarting testing again.  Thanks!
Comment 23 Bob Johnson 2003-11-14 10:22:42 EST
Created attachment 95969 [details]
log from from IBM
Comment 24 Bob Johnson 2003-11-14 10:24:04 EST
the log file I attached is a dupe of the one Mark just posted, sorry
Comment 26 Stephen Tweedie 2003-11-14 10:40:16 EST
Thanks.  Looks like this one doesn't add a whole lot that's new,
though: it didn't trigger the booby trap in the test kernel.  It's the
second log we've seen with rebalance_dirty_zone(), though.  Right now
I can't tell if there's a problem there, or if we're just being
unlucky and that's where we trip up on corruption that has happened
already elsewhere.

Do you have any other logs?  For this specific problem, I expect that
having as many instances as possible to examine may probabilistically
narrow down the real problem here.
Comment 27 Stephen Tweedie 2003-11-14 10:41:22 EST
By the way, are you running these tests with ext2 or ext3?  Could you
try reproducing with the other fs?

Thanks.
Comment 28 IBM Bug Proxy 2003-11-15 15:57:23 EST
Created attachment 96000 [details]
netdump log
Comment 29 IBM Bug Proxy 2003-11-15 15:57:59 EST
------ Additional Comments From kenbo@us.ibm.com  2003-14-11 09:02 -------
 
netdump log on crash after changing to test kernel

I am still not getting a vmcore-incomplete file (zero length), but here is the
netdump log for crash which happened overnight under load.  I am running the
2.4.21-4.1.sct1smp kernel.  Restarting testing again.  Thanks! 
Comment 30 IBM Bug Proxy 2003-11-15 16:07:06 EST
Created attachment 96001 [details]
log.newkernel
Comment 31 IBM Bug Proxy 2003-11-16 20:38:29 EST
------ Additional Comments From kenbo@us.ibm.com  2003-16-11 14:53 -------
I have been trying to reproduce the crash with ext2 and so far I have been 
unsuccesful - meaning that under ext2 I have been unable to crash the OS under 
a Domino load after 2ish days of testing. 
Comment 32 IBM Bug Proxy 2003-11-17 14:09:31 EST
------ Additional Comments From kenbo@us.ibm.com  2003-17-11 14:08 -------
I went to see if there were some updated patches for my RHAS 3.0 system and 
got some weirdness - here is the output:

[root@rhas30 root]# up2date --nosig -u

Fetching package list for channel: rhel-i386-as-3...
########################################

Fetching Obsoletes list for channel: rhel-i386-as-3...
########################################

Fetching rpm headers...
########################################

Name                                    Version        Rel     
----------------------------------------------------------
ethereal                                0.9.16         0.30E.1             
i386  
glibc                                   2.3.2          95.6                
i686  
glibc-common                            2.3.2          95.6                
i386  
glibc-devel                             2.3.2          95.6                
i386  
glibc-headers                           2.3.2          95.6                
i386  
glibc-profile                           2.3.2          95.6                
i386  
glibc-utils                             2.3.2          95.6                
i386  
iproute                                 2.4.7          11.30E.1            
i386  
kernel                                  2.4.21         4.0.1.EL            
i686  
nptl-devel                              2.3.2          95.6                
i686  
nscd                                    2.3.2          95.6                
i386 
Comment 33 Stephen Tweedie 2003-11-21 12:09:24 EST
I've still got no great ideas on this one.  Basically, all we know is
that some buffer memory has become corrupt.  A dump could be extremely
valuable here if you can manage to create one.

What sort of hardware is it you are running this workload on (memory
size, disks, CPUs etc)?

Are there _any_ other errors in the logs?
Comment 34 mark wisner 2003-11-24 22:40:54 EST
------ Additional Comments From khoa@us.ibm.com  2003-24-11 22:40 -------
Sachin - is there anything we can do here to help ?  Please take a look at
this bug for me...Thanks. 
Comment 35 mark wisner 2003-11-25 06:05:56 EST
------ Additional Comments From ssant@in.ibm.com  2003-25-11 04:15 -------
chinmay can you check if we can recreate the problem here. Please have a look at 
this bug and update with your observation. 
Comment 36 mark wisner 2003-11-25 06:10:44 EST
------ Additional Comments From achinmay@in.ibm.com(prefers email via albal@in.ibm.com)  2003-25-11 06:11 -------
Hi,

Could you please let us know where we could download domino and the test scripts
to try and recreate the problem here.

Regards
- Chinmay 
Comment 37 Stephen Tweedie 2003-11-25 06:27:25 EST
Over the past few days I've been chasing a separate support issue
which looked very similar to this one.  We were able to get two
separate vmcore dumps which showed precisely where the buffer_head
lists were getting corrupted.  Unfortunately, in that case the
footprint looks very much like hardware memory corruption; we're
getting the customer to check that now.

So unfortunately that line of attack looks like it hasn't panned out
for now.  I think the next step needs to be to try again to get a
vmcore dump of the ext3 crash here.  If we can't get a reproducer
internally, can you try once again to get a netdump vmcore yourselves?

Thanks.
Comment 39 mark wisner 2003-12-03 19:23:20 EST
------ Additional Comments From khoa@us.ibm.com  2003-03-12 18:24 -------
I've asked Mike Lepore to send the Lotus Domino team info on where they 
can put their Domino testcase which then can be obtained by India team.
We will use the 8-way x370 server here in Austin as the ftp site for
this Domino testcase. 
Comment 40 mark wisner 2003-12-04 10:26:37 EST
------ Additional Comments From kenbo@us.ibm.com  2003-04-12 10:23 -------
In order to reproduce this bug, you need to do the following on a RHEL 3.0 & 
Windows drivers systems:

  1. ftp to the ltc ftp server and download from /kenbo a) linux.tar (~715Mb), 
then cd into /kenbo/windows and download the contents into a separate 
directory on your system (the contents are the 6.5 Notes client for windows 
and testnsf with an example cmd file)

On your RHEL 3.0 system:
  1. unpack linux.tar and use the contents (Pre-Release 6.51 daily non 
production build) to install/setup the Domino server - make sure to have the 
admin.id file saved local to the data directory.
  2. in a window, start up Domino from the data directory
  3. at the server prompt, type the command "sh users" so that the server will 
display number of connected users every minute.
  
On your Windows driver system,
  1. ftp to your RHEL 3.0 machine and bring over the admin.id file.
  2. using the files from /kenbo/windows, run setup to install the notes 
client, using the admin.id file as the user
  3. Once install/setup is complete, copy testnsf.exe over to the executables 
directory (e:/lotus/notes for example, same place nserver.exe is at)
  4. From a command prompt, cd to your notes exec directory and run the 
command as shown in testnsf.cmd or here - "testnsf -user -t 1500 -g 9999 
SERVER" where SERVER is the name of your RHEL 3.0 Domino for Linux server 
(such as rhel30/dev, for example).  This will have testnsf run a "user" script 
which exhibits the behavior of a moderately active user, it will ramp up to 
1500 users over a period of time and it will run for a long time.  If your 
driver machine is not strong enough, you can duplicate this over several 
machines, so for example have 4 machines setup like this and each running 400 
users instead of 1500 on one.
  5. If everything is done correctly, you should see your users connecting to 
the RHEL 3.0 Linux Domino server.

Eventually, the Linux server should crash under load.  If you have setup 
netdump between this RHEL 3.0 machine and another RHEL 3.0 machine, you should 
at least get a log file for the crash.  In the 3 times I've crashed, I've 
never gotten a core file, and I've gotten good log files 2 times - which are 
in this bugzilla.

To reiterate, the system I was running on was a 4-way 2.8Ghz Intel system with 
4GB RAM and SCSI harddrives.  You will need a good amount of disk space for 
this - approximately 2-4MB per user, and after every other run, you may want 
to remove the user mailboxes (they are in the testtmp directory under the data 
directory).


Good luck! 
Comment 46 Bob Johnson 2003-12-05 09:31:50 EST
IBM, we will be on site on Thursday next week to investigate.  Please
have the system and people available, Stephen is going to be in
Westford for other business and will have sometime to work this real time.
Comment 50 IBM Bug Proxy 2003-12-08 10:27:19 EST
------ Additional Comments From khoa@us.ibm.com  2003-08-12 10:28 -------
Chinmay & Sachin - do you have any update on this ?  Did you get the testcase?
Thanks. 
Comment 53 IBM Bug Proxy 2003-12-09 13:36:23 EST
------ Additional Comments From dmosby@us.ibm.com  2003-09-12 13:34 -------
We need a dump from the system. I spoke with Jim K., our netdump expert here 
in Beaverton. He states that if we are getting the stack portion of the dump 
and seeing the core incomplete message this indicates that Netdump is set up 
OK. His has no experience with RHEL 3.0. With RH 2.1 he sees 50% success, with 
7.3 10% success. He states that only think he can suggest is to keep trying to 
generate a dump and hope for success. Perhaps the RH folks can offer their 
experience with Netdump on RHEL 3.0. The guide to Netdump (besices man/info 
docs on system) is: 
http://www.redhat.com/support/wpapers/redhat/netdump/index.html
But Jim tells us that based on Netdump action observed he believes it is set 
up correctly.
Other action we could try would be to set up an LKCD kernel and hope it does 
not affect the behavior seen. 
Comment 55 Stephen Tweedie 2003-12-09 15:21:54 EST
It's definitely worth trying again to get a vmcore; that's likely to
be a great help in chasing this down.
Comment 57 Dave Anderson 2003-12-09 15:59:20 EST
Re: the netdump-server, what error messages did it log in
the /var/log/messages file on the netdump server?
Comment 58 Bob Johnson 2003-12-09 16:06:01 EST
IBM - what type of NIC cards in these systems ?
Please enumerate here, thanks.
Comment 59 IBM Bug Proxy 2003-12-09 16:46:17 EST
------ Additional Comments From kenbo@us.ibm.com  2003-09-12 16:46 -------
The system is an Intel Xeon system with an E100 ethernet card and E1000 
ethernet card which is disabled.  As mentioned earlier, the system has 4 
2.8GHz Xeon MP cpus, 4Gb memory,2Gb swap, and is running on a single 36gb SCSI 
harddrive.

As far as error messages in /var/log/messages, I've only ever seen the one 
message I talked of earlier - "Oct 28 13:17:33 rhas30 kernel: VFS: brelse: 
Trying to free free buffer"

The other crashes did not show anything in the messages file. 
Comment 60 IBM Bug Proxy 2003-12-09 17:11:17 EST
------ Additional Comments From dmosby@us.ibm.com  2003-09-12 17:10 -------
A couple comments (nothing incredibly insightful, but all I can think
of at the moment):
1) Can we reduce memory? Netdump often seems to have problems with large
memory, so 2 gig would be better than 4 gig, 1 gig even better yet.
If we can restrict memory we have a better shot at getting a dump based
on collective wisdom around here.
2) In comment 17 there is the note that the test ramps up very slowly.
Is this something we can change? If the test has control parameters that
let us ramp up stress a bit faster we might have more shots at trying to
get a dump.
3) What about messages on the system that should have received the netdump?
Any clues there? (As to why netdump failed.)

The modules file and your description of hardware didn't look like anything
special as far as hardware, and you say you have the gigabit ether
disabled -- and I know that every bug looks just like the last one you
worked on -- but the last one I worked on had memory corruption caused by
a bug in a 4-port gigabit ether card. So if any oddball devices with
drivers that could be suspect please mention those. 
Comment 61 IBM Bug Proxy 2003-12-09 17:27:19 EST
Created attachment 96437 [details]
info.tar
Comment 62 IBM Bug Proxy 2003-12-09 17:27:29 EST
------ Additional Comments From kenbo@us.ibm.com  2003-09-12 16:30 -------
 
system info in a tar file

This tar file contains 4 attachments containing information about the system we
are using to test with. 
Comment 63 IBM Bug Proxy 2003-12-10 06:30:46 EST
------ Additional Comments From achinmay@in.ibm.com(prefers email via albal@in.ibm.com)  2003-10-12 06:29 -------
Hi,

Iam trying to recreate the problem. No luck yet.

Regards
- Chinmay 
Comment 64 Dave Anderson 2003-12-10 08:36:37 EST
> As far as error messages in /var/log/messages, I've only ever seen
> the one message I talked of earlier - "Oct 28 13:17:33 rhas30
> kernel: VFS: brelse: Trying to free free buffer"
>
> The other crashes did not show anything in the messages file. 

Please post the /var/log/messages file from the *netdump server*
machine.  The netdump server daemon will post error messages as
to why it bailed out when trying to create the vmcore file.
Comment 65 IBM Bug Proxy 2003-12-10 08:41:13 EST
------ Additional Comments From kenbo@us.ibm.com  2003-12-10 08:38 -------
Ahh, ok, I reproduced the bug again.  Got a netdump log file but only a zero 
length vmcore.  Since the filesystem on the server machine only has 2GB free, 
perhaps it is indeed related to the memory size and so I'm going to look into 
disabling the memory down to 2GB.  One thing of note, on the previous runs 
since Tuesday morning I would each time let it run for long hours, then do 
some disk cleanup, then restart.  This time I let it continue to run into a 
low disk space situation.  The filesystem is 36Gb, and prior to the run it has 
30GB free, at the crash it has just under 1GB free.  I don't know if this 
relates, but with a 4GB memory and only 2GB swap, I thought it worthy of 
mentioning.  I will attach the log file as usual - then I will restart the 
test again, but this time I will not cleanup the disk (other than to have 
domino do fixups on the databases).  Thanks! 
Comment 66 Stephen Tweedie 2003-12-10 08:50:29 EST
OK, thanks.  Do you have any idea whether the previous crashes might
be associated with low-disk-space conditions too?  Was the filesystem
actually completely full at any point?
Comment 67 IBM Bug Proxy 2003-12-10 09:12:14 EST
Created attachment 96443 [details]
netdump log sct kernel
Comment 68 IBM Bug Proxy 2003-12-10 09:12:23 EST
------ Additional Comments From kenbo@us.ibm.com  2003-12-10 08:42 -------
 
netdump log of crash for 12/10 at 3:38am on sct kernel 
Comment 69 Stephen Tweedie 2003-12-10 09:26:08 EST
Has this problem ever been reproduced on another machine?  I recently
got to the bottom of a problem which looked very similar to this one,
but which turned out to be due to a hardware memory fault.  The vmcore
was able to identify exactly the format of the memory corruption in
that case.
Comment 70 Dave Anderson 2003-12-10 10:33:48 EST
If the filesystem containing /var/crash directory is smaller than 
the memory image size, then you will most definitely get a
vmcore-incomplete file of zero bytes, and the precise behaviour
being seen here.  When that happens, the netdump-server daemon
will syslog a message that states:

  "No space for dump image"

That message will be found in the /var/log/messages file on 
the netdump server box.  That is why we've been asking for
a copy of the messages on the *server* side machine, please! 

Comment 71 IBM Bug Proxy 2003-12-10 10:46:34 EST
------ Additional Comments From dmosby@us.ibm.com  2003-12-10 10:43 -------
The system set up to receive the netdump must have more free disk space than 
the system generating the dump has free memory. Would be interesting to see 
log message from the system trying to store the dump, but really no use 
continuing with the current box set for netdump. We need to try again with a 
new system to receive the dump. 
Comment 72 IBM Bug Proxy 2003-12-10 11:22:18 EST
------ Additional Comments From kenbo@us.ibm.com  2003-12-10 11:21 -------
This was indeed the problem - my 2nd RHAS 3.0 system did not have enough free 
space for the core image (the log file on the netdump-server machine did have 
messages to this affect).  I hardmounted a directory from another system with 
25GB free and root on my RHAS 3.0 system has full write permissions - I then 
plan on making /var/crash point to this filesystem.  Does it sound like this 
will work?  Otherwise, I can try and get a hold of an external drive and 
attach it physically to the netdump-server and use that for /var/crash. Thanks! 
Comment 73 IBM Bug Proxy 2003-12-10 11:31:15 EST
------ Additional Comments From dmosby@us.ibm.com  2003-12-10 11:30 -------
The net mounted file system should be fine. Thanks. 
Comment 74 IBM Bug Proxy 2003-12-10 23:16:34 EST
------ Additional Comments From khoa@us.ibm.com  2003-12-10 23:16 -------
The India team has been able to get the workload running for more than 8
hours but has not encountered any kernel panics yet.  The team has already
set up netdump, so if and when a kernel panic occurs, a dump will be taken.
I have asked the team to upload the dump to a server here in Austin and
notify me - if and when a kernel panic happens.  Right now all we can do
is wait :-) 
Comment 75 IBM Bug Proxy 2003-12-15 10:01:45 EST
------ Additional Comments From achinmay@in.ibm.com(prefers email via albal@in.ibm.com)  2003-12-15 07:44 -------
Hi,

We keep seeing this error on the client  
" * Compose 2 new mail memos/replies (taking 1-2 minutes to write them).
avigate 70 -> 71 Time(ms): 1382
04C0:03EE-1778]: *** (cmd=navigate): Error in NIFOpenNote: 0x030C--No document
o navigate to dB CN=flash/O=IBM!!testtmpmail1004.nsf
04C0:01C2-0C2C]: *** (cmd=navigate): Error in NIFOpenNote: 0x030C--No document
o navigate to dB CN=flash/O=IBM!!testtmpmail448.nsf
04C0:03D6-1AF4]: *** (cmd=navigate): Error in NIFOpenNote: 0x030C--No document
o navigate to dB CN=flash/O=IBM!!testtmpmail980.nsf
04C0:0428-171C]: *** (cmd=navigate): Error in NIFOpenNote: 0x030C--No document
o navigate to dB CN=flash/O=IBM!!testtmpmail1062.nsf
ause 0  msec   * pause (0.00  msec)
navigate 70 -> 71 Time(ms): 1362 "

and a few hundred domino server threads are always open at the server side. Is
this behaviour expected ? or we doing something wrong with regards to the test
configuration ?

Regards
- Chinmay 
Comment 76 IBM Bug Proxy 2003-12-15 10:02:57 EST
------ Additional Comments From kenbo@us.ibm.com  2003-12-15 08:39 -------
Domino 6 is a thread-per-connection model.  Therefore, you should see 
approximately 1 server "thread" per client connection.  If you are seeing less 
than this, then something is wrong somewhere.  If you type the command "sh 
perf" at the server command prompt, it will tell you the trans/min and # users 
connected each minute - this can help you to see if you are a) connecting to 
the server ok, and b) doing usefull work.  For example, I see anywhere from 
4000 to 7000 trans/min when I have my 1500 users connected. 
Comment 78 IBM Bug Proxy 2003-12-18 13:26:36 EST
------ Additional Comments From kenbo@us.ibm.com  2003-12-18 13:21 -------
I'm beginning to wonder if this isn't related to a hardware issue.  The system 
had been in one persons office and the heat was high in there - turns out it 
got to as high as 115 degrees - and that was when I was seeing the crashes.  I 
moved the system into my office to be able to better test/debug and since then 
saw only 1 crash (on day two) and I've now been up under various testing load 
for 8+ days.  My office is showing a high temperature of 93 degrees since I 
started tracking it a few days ago.  So, at this point I can no longer 
duplicate the bug - I will continue to try, but I'm ok with putting this bug 
into a lesser range or even closing it until I can reproduce it.  Thanks! 
Comment 79 Stephen Tweedie 2003-12-18 15:01:20 EST
OK, I'll close it as WORKSFORME for now, but if you do get a crash
dump out of subsequent testing please feel free to send that in so
that I can check to see for any obvious signs of hardware corruptions.
 The only other time I've seen this footprint, there was a clear sign
of bad memory in the vmcore, so it's worth checking.
Comment 80 IBM Bug Proxy 2003-12-24 10:38:38 EST
------ Additional Comments From khoa@us.ibm.com  2003-12-23 23:12 -------
This is great news in the sense that this may indeed be a hardware issue.
Please re-open this bug report if you can reproduce it or have some info
indicating that this is really not a hardware issue.  Thanks. 

Note You need to log in before you can comment on or make changes to this bug.