From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4)
Description of problem:
It is the vmware memory stress test again.
This test is ususally very good at exposing linux kernel bug.
It is very consistant happe after 1 to 2 days of running
the stress test.
The kernel OOPS is happen on the host ( the real machine).
It is reproduceable on different machine:
2CPU/4GB RAM/2GHz box
2CPU/8GB RAM/700MHz box
The VM is running redhat's redhatready for RHEL3.0 at
CORE and MEMORY test.
kernel BUG at buffer.c:604!
invalid operand: 0000
sg sr_mod ide-cd cdrom vmnet vmmon parport_pc lp parport autofs nfs
e100 e1000 floppy microcode keybdev mousedev hid
+input usb-ohci usbcore ext3
EIP: 0060:[<c016220e>] Tainted: PF
EIP is at __insert_into_lru_list [kernel] 0x1e (2.4.21-4.ELsmp)
eax: 00000005 ebx: 00000002 ecx: e2c0d2b0 edx: c04d1b10
esi: e2c0d2b0 edi: e2c0d2b0 ebp: 00001000 esp: f08bfe50
ds: 0068 es: 0068 ss: 0068
Process vmware-vmx (pid: 3235, stackpage=f08bf000)
Stack: 00000002 c0162ca6 e2c0d2b0 00000002 e2c0d2b0 00001000 c0162cdc
c0163b55 e2c0d2b0 00000000 13f1b000 00000000 c1745aa8 f5796700
f5796700 c1745aa8 00000000 00001000 c1745aa8 f5796700 00000000
Call Trace: [<c0162ca6>] __refile_buffer [kernel] 0x56 (0xf08bfe54)
[<c0162cdc>] refile_buffer [kernel] 0x1c (0xf08bfe68)
[<c0163b55>] __block_commit_write [kernel] 0xb5 (0xf08bfe70)
[<c01643ef>] generic_commit_write [kernel] 0x3f (0xf08bfe8c)
[<f88687b0>] ext3_commit_write [ext3] 0x1c0 (0xf08bfeb0)
[<f8868410>] journal_dirty_sync_data [ext3] 0x0 (0xf08bfecc)
[<c0149065>] do_generic_file_write [kernel] 0x235 (0xf08bfeec)
[<c014956f>] generic_file_write [kernel] 0x13f (0xf08bff40)
[<f8865149>] ext3_file_write [ext3] 0x39 (0xf08bff6c)
[<c0160ada>] sys_pwrite [kernel] 0xca (0xf08bff8c)
Code: 0f 0b 5c 02 c3 91 2b c0 8b 02 85 c0 75 07 89 0a 89 49 28 8b
Kernel panic: Fatal exception
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.allocate VM with 1.5G.
2.start the redhat ready test in VM
3.after 1-2 days of running. Host crashed.
Actual Results: crashed
Expected Results: now crashed?
Can this problem be reproduced on an untainted kernel?
The module that cause the kernel to be tained is vmmon and vmnet.
But this two modules are open sourced. You can get all the source
code of the module. The source of the module came with VMware.
It is just not specify as GPL yet.
Let me know if you have any problems with that.
Let me plug in more progress from my side. I am looking at the bug
also. The assert complain bh lru pointer is not NULL when it about
to insert to the lru list. Well it just be take out from lru and
normally it should reset the two lru pointer to NULL. And all this
is protect by the lru spin lock. All this looks sane. It might
have other path get into the bh to set the bh pointer I don't
So I start recompile the redhat kernel get from the kernel source.
cp config/*smp*i686* .config
make install modules_install
Then I get a custom build kernel from the same source (hopefully
kernel-source rpm contain all the right patch.) and I pass to QA
to test it again. Last week, the first try of custom kernel pass
the test. I am asking QA to do more run of it in hope to reproduce
the problem so I can insert my debug code to verify where does the
corrupt pointer come from.
If you or Stephen has some insight on what is going on or have
some patch want me to try, please let me know.
Thanks for the info, Christopher. Good luck on your debugging.
I'm reassigning this to our VMware contact Todd Barr, since my
understanding is that Red Hat doesn't support custom-built
kernels. If you can later reproduce this problem on a stock
Red Hat kernel, please let us know.
You did not read my comment carefully. We have no problem
reproduce the bug on the stock Redhat kernel.
I get the custom build kernel is for debugging and better
understand the issue. We haven't reproduce it on the custom
kernel yet. The bug did not go away because I am trying to
build a custom kernel.
So far the problem does exist on stock redhat kernel.
Does the problem occur on the stock kernels without any non-RH
loadable modules? Specifically, does it fail if vmmon and vmnet are
I did carefully read the above description but was unable to make
Sorry I did not make that clear enough.
We can't do the test without the vmmon and vmnet module.
The memory load is generate from inside the vmware guest,
which is a redhat linux BTW. We can't run vmware without
the vmmon module, which you can get all the source code
The good thing about this test is that the memory load is
very real. The memory load is generate from the guest OS
instead of some simple program try to allocate and touch
memory. It is a very good way to test the linux kernel
as well. In the history, we have found lots of bugs in
kernel or redhat related patches. Please take a look at
redhat bug 85275 for example.
The mutual support agreement we have with vmware is that vmware fields
issues on their end. Then if they have specific problems, they should
be demonstrated on a generic RHEL configuration. vmmon and vmnet are
not part of a generic RHEL config.
I agree that vmmon is not part of generic RHEL config.
The recompiled kernel pass the test 3 times in a row.
Do you feel strange that the stock kernel will crash but
the recompile one does not? It might indicate there is something
wrong with the stock kernel.
Even if I want to nail down what is going on. Right now
the problem is points to the binary stock kernel. But I can't
You are building a kernel with a completely different set of config
options. Thats why its different:
cp config/*smp*i686* .config
make install modules_install
what that does is whipe the config file entirely and use defconfig, which
quite likely will result in an entirely different config.
OOPS, I forget mrproper blow away the .config.
I just make a diff on the .config, the most obvious one is the SMP
and HIGHIO. I am restarting the process again.
Sorry for that.
One question, in what form does vmware store the virtual memory for
its guest ?
Does it use an mmap()d file, anonymous memory, tmpfs, ... ?
If it uses an mmap()d file, can the bug be reproduced on ext2 or only
with ext3 ?
It mmap() a file call "ram#". In this case the ram file is on ext3.
We can put the ram file at other file system as well. e.g. /dev/shm
We will try other file system to narrow down the problem specific
to file system or not.
But that will take some time. It usually take at least one day to
reach the crash. It is running the recompiled kernel right now.
Since this is a typical customer setup. We like to find out
the root cause of the problem.
Thanks for the suggestion.
A few things to update about his bug.
- the recompiled kernel with the right config file did not reproduce
- In the stock kernel, change file system mount as ext2 did not
reproduce the bug.
- Using the redhat update 2 kernel did not reproduce the bug.
It seems only the original stock kernel trigger that.
OK, then I guess vmware was triggering a VM bug in GA that was fixed
Chris, would it be ok if we closed this bug ?
We will be okay with closing this once our logs from the tests have
been approved and we get a posting on the RH's HCL. =)
I'm waiting on Rob Landry's reply back on this.
I'm okay with closing this bug out. I've updated my host machine to
Update2 and the issue is not longer there. Thanks.