Bug 121728

Summary: kernel BUG at buffer.c:604 on memory stress test
Product: Red Hat Enterprise Linux 3 Reporter: Christopher Li <chrisl>
Component: kernelAssignee: Rik van Riel <riel>
Status: CLOSED WORKSFORME QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: petrides, riel, schou, sct, tburke
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-06-14 20:40:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Christopher Li 2004-04-26 19:38:03 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4)
Gecko/20030624 Netscape/7.1

Description of problem:
It is the vmware memory stress test again.
This test is ususally very good at exposing linux kernel bug.
It is very consistant happe after 1 to 2 days of running
the stress test.

The kernel OOPS is happen on the host ( the real machine).
It is reproduceable on different machine:
2CPU/4GB RAM/2GHz box
2CPU/8GB RAM/700MHz box

The VM is running redhat's redhatready for RHEL3.0 at
CORE and MEMORY test.


kernel BUG at buffer.c:604!
invalid operand: 0000
sg sr_mod ide-cd cdrom vmnet vmmon parport_pc lp parport autofs nfs
lockd sunrpc
e100 e1000 floppy microcode keybdev mousedev hid
+input usb-ohci usbcore ext3
CPU:    2
EIP:    0060:[<c016220e>]    Tainted: PF
EFLAGS: 00013206

EIP is at __insert_into_lru_list [kernel] 0x1e (2.4.21-4.ELsmp)
eax: 00000005   ebx: 00000002   ecx: e2c0d2b0   edx: c04d1b10
esi: e2c0d2b0   edi: e2c0d2b0   ebp: 00001000   esp: f08bfe50
ds: 0068   es: 0068   ss: 0068
Process vmware-vmx (pid: 3235, stackpage=f08bf000)
Stack: 00000002 c0162ca6 e2c0d2b0 00000002 e2c0d2b0 00001000 c0162cdc
e2c0d2b0
       c0163b55 e2c0d2b0 00000000 13f1b000 00000000 c1745aa8 f5796700
c01643ef
       f5796700 c1745aa8 00000000 00001000 c1745aa8 f5796700 00000000
f57967c4
Call Trace:   [<c0162ca6>] __refile_buffer [kernel] 0x56 (0xf08bfe54)
[<c0162cdc>] refile_buffer [kernel] 0x1c (0xf08bfe68)
[<c0163b55>] __block_commit_write [kernel] 0xb5 (0xf08bfe70)
[<c01643ef>] generic_commit_write [kernel] 0x3f (0xf08bfe8c)
[<f88687b0>] ext3_commit_write [ext3] 0x1c0 (0xf08bfeb0)
[<f8868410>] journal_dirty_sync_data [ext3] 0x0 (0xf08bfecc)
[<c0149065>] do_generic_file_write [kernel] 0x235 (0xf08bfeec)
[<c014956f>] generic_file_write [kernel] 0x13f (0xf08bff40)
[<f8865149>] ext3_file_write [ext3] 0x39 (0xf08bff6c)
[<c0160ada>] sys_pwrite [kernel] 0xca (0xf08bff8c)

Code: 0f 0b 5c 02 c3 91 2b c0 8b 02 85 c0 75 07 89 0a 89 49 28 8b

Kernel panic: Fatal exception



Version-Release number of selected component (if applicable):
kernel-smp-2.4.21-4.EL.i686.rpm

How reproducible:
Always

Steps to Reproduce:
1.allocate VM with 1.5G.
2.start the redhat ready test in VM
3.after 1-2 days of running. Host crashed.
    

Actual Results:  crashed

Expected Results:  now crashed?

Additional info:

Comment 1 Ernie Petrides 2004-05-03 20:11:24 UTC
Can this problem be reproduced on an untainted kernel?


Comment 2 Christopher Li 2004-05-03 21:08:20 UTC
The module that cause the kernel to be tained is vmmon and vmnet.
But this two modules are open sourced. You can get all the source
code of the module. The source of the module came with VMware.
It is just not specify as GPL yet.

Let me know if you have any problems with that.

Let me plug in more progress from my side. I am looking at the bug
also. The assert complain bh lru pointer is not NULL when it about
to insert to the lru list. Well it just be take out from lru and
normally it should reset the two lru pointer to NULL. And all this
is protect by the lru spin lock. All this looks sane. It might
have other path get into the bh to set the bh pointer I don't
know about.

So I start recompile the redhat kernel get from the kernel source.
cp config/*smp*i686* .config
make mrproper
make modules
make install modules_install

Then I get a custom build kernel from the same source (hopefully
kernel-source rpm contain all the right patch.) and I pass to QA
to test it again. Last week, the first try of custom kernel pass
the test. I am asking QA to do more run of it in hope to reproduce
the problem so I can insert my debug code to verify where does the
corrupt pointer come from.

If you or Stephen has some insight on what is going on or have
some patch want me to try, please let me know.


Comment 3 Ernie Petrides 2004-05-03 21:43:01 UTC
Thanks for the info, Christopher.  Good luck on your debugging.

I'm reassigning this to our VMware contact Todd Barr, since my
understanding is that Red Hat doesn't support custom-built
kernels.  If you can later reproduce this problem on a stock
Red Hat kernel, please let us know.

Cheers.  -ernie


Comment 4 Christopher Li 2004-05-03 23:04:36 UTC
You did not read my comment carefully. We have no problem
reproduce the bug on the stock Redhat kernel.

I get the custom build kernel is for debugging and better
understand the issue. We haven't reproduce it on the custom
kernel yet. The bug did not go away because I am trying to
build a custom kernel.

So far the problem does exist on stock redhat kernel.



Comment 5 Tim Burke 2004-05-03 23:47:54 UTC
Does the problem occur on the stock kernels without any non-RH
loadable modules?  Specifically, does it fail if vmmon and vmnet are
not loaded?
  I did carefully read the above description but was unable to make
this distinction.


Comment 6 Christopher Li 2004-05-04 00:16:02 UTC
Sorry I did not make  that clear enough.

We can't do the test without the vmmon and vmnet module.

The memory load is generate from inside the vmware guest,
which is a redhat linux BTW. We can't run vmware without
the vmmon module, which you can get all the source code
BTW.

The good thing about this test is that the memory load is
very real. The memory load is generate from the guest OS
instead of some simple program try to allocate and touch
memory. It is a very good way to test the linux kernel
as well. In the history, we have found lots of bugs in
kernel or redhat related patches. Please take a look at
redhat bug 85275 for example.


Comment 7 Tim Burke 2004-05-04 00:30:31 UTC
The mutual support agreement we have with vmware is that vmware fields
issues on their end.  Then if they have specific problems, they should
be demonstrated on a generic RHEL configuration.  vmmon and vmnet are
not part of a generic RHEL config.


Comment 8 Christopher Li 2004-05-05 23:16:02 UTC
I agree that vmmon is not part of generic RHEL config.
The recompiled kernel pass the test 3 times in a row.
Do you feel strange that the stock kernel will crash but
the recompile one does not? It might indicate there is something
wrong with the stock kernel.

Even if I want to nail down what is going on. Right now
the problem is points to the binary stock kernel. But I can't
debug that.



Comment 9 Tim Burke 2004-05-06 13:26:34 UTC
You are building a kernel with a completely different set of config
options.  Thats why its different:

cp config/*smp*i686* .config
make mrproper
make modules
make install modules_install

what that does is whipe the config file entirely and use defconfig, which
quite likely will result in an entirely different config.

Comment 10 Christopher Li 2004-05-06 18:08:34 UTC
OOPS, I forget mrproper blow away the .config.
I just make a diff on the .config, the most obvious one is the SMP
and HIGHIO. I am restarting the process again.

Sorry for that.


Comment 11 Rik van Riel 2004-05-07 20:14:10 UTC
One question, in what form does vmware store the virtual memory for
its guest ?

Does it use an mmap()d file, anonymous memory, tmpfs, ... ?

If it uses an mmap()d file, can the bug be reproduced on ext2 or only
with ext3 ?

Comment 12 Christopher Li 2004-05-07 20:35:24 UTC
It mmap() a file call "ram#". In this case the ram file is on ext3.
We can put the ram file at other file system as well. e.g. /dev/shm
We will try other file system to narrow down the problem specific
to file system or not.

But that will take some time. It usually take at least one day to
reach the crash. It is running the recompiled kernel right now.

Since this is a typical customer setup. We like to find out
the root cause of the problem.

Thanks for the suggestion.


Comment 13 Christopher Li 2004-05-28 00:37:28 UTC
A few things to update about his bug.
- the recompiled kernel with the right config file did not reproduce
the bug.
- In the stock kernel, change file system mount as ext2 did not
reproduce the bug.
- Using the redhat update 2 kernel did not reproduce the bug.

It seems only the original stock kernel trigger that.

Comment 14 Rik van Riel 2004-05-28 01:41:22 UTC
OK, then I guess vmware was triggering a VM bug in GA that was fixed
later.

Chris, would it be ok if we closed this bug ?

Comment 15 Shirley Chou 2004-05-28 01:50:06 UTC
We will be okay with closing this once our logs from the tests have 
been approved and we get a posting on the RH's HCL. =)
I'm waiting on Rob Landry's reply back on this.

Comment 16 Shirley Chou 2004-06-14 20:24:25 UTC
I'm okay with closing this bug out. I've updated my host machine to 
Update2 and the issue is not longer there. Thanks.

Comment 17 Rik van Riel 2004-06-14 20:40:36 UTC
OK.