From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 Description of problem: It is the vmware memory stress test again. This test is ususally very good at exposing linux kernel bug. It is very consistant happe after 1 to 2 days of running the stress test. The kernel OOPS is happen on the host ( the real machine). It is reproduceable on different machine: 2CPU/4GB RAM/2GHz box 2CPU/8GB RAM/700MHz box The VM is running redhat's redhatready for RHEL3.0 at CORE and MEMORY test. kernel BUG at buffer.c:604! invalid operand: 0000 sg sr_mod ide-cd cdrom vmnet vmmon parport_pc lp parport autofs nfs lockd sunrpc e100 e1000 floppy microcode keybdev mousedev hid +input usb-ohci usbcore ext3 CPU: 2 EIP: 0060:[<c016220e>] Tainted: PF EFLAGS: 00013206 EIP is at __insert_into_lru_list [kernel] 0x1e (2.4.21-4.ELsmp) eax: 00000005 ebx: 00000002 ecx: e2c0d2b0 edx: c04d1b10 esi: e2c0d2b0 edi: e2c0d2b0 ebp: 00001000 esp: f08bfe50 ds: 0068 es: 0068 ss: 0068 Process vmware-vmx (pid: 3235, stackpage=f08bf000) Stack: 00000002 c0162ca6 e2c0d2b0 00000002 e2c0d2b0 00001000 c0162cdc e2c0d2b0 c0163b55 e2c0d2b0 00000000 13f1b000 00000000 c1745aa8 f5796700 c01643ef f5796700 c1745aa8 00000000 00001000 c1745aa8 f5796700 00000000 f57967c4 Call Trace: [<c0162ca6>] __refile_buffer [kernel] 0x56 (0xf08bfe54) [<c0162cdc>] refile_buffer [kernel] 0x1c (0xf08bfe68) [<c0163b55>] __block_commit_write [kernel] 0xb5 (0xf08bfe70) [<c01643ef>] generic_commit_write [kernel] 0x3f (0xf08bfe8c) [<f88687b0>] ext3_commit_write [ext3] 0x1c0 (0xf08bfeb0) [<f8868410>] journal_dirty_sync_data [ext3] 0x0 (0xf08bfecc) [<c0149065>] do_generic_file_write [kernel] 0x235 (0xf08bfeec) [<c014956f>] generic_file_write [kernel] 0x13f (0xf08bff40) [<f8865149>] ext3_file_write [ext3] 0x39 (0xf08bff6c) [<c0160ada>] sys_pwrite [kernel] 0xca (0xf08bff8c) Code: 0f 0b 5c 02 c3 91 2b c0 8b 02 85 c0 75 07 89 0a 89 49 28 8b Kernel panic: Fatal exception Version-Release number of selected component (if applicable): kernel-smp-2.4.21-4.EL.i686.rpm How reproducible: Always Steps to Reproduce: 1.allocate VM with 1.5G. 2.start the redhat ready test in VM 3.after 1-2 days of running. Host crashed. Actual Results: crashed Expected Results: now crashed? Additional info:
Can this problem be reproduced on an untainted kernel?
The module that cause the kernel to be tained is vmmon and vmnet. But this two modules are open sourced. You can get all the source code of the module. The source of the module came with VMware. It is just not specify as GPL yet. Let me know if you have any problems with that. Let me plug in more progress from my side. I am looking at the bug also. The assert complain bh lru pointer is not NULL when it about to insert to the lru list. Well it just be take out from lru and normally it should reset the two lru pointer to NULL. And all this is protect by the lru spin lock. All this looks sane. It might have other path get into the bh to set the bh pointer I don't know about. So I start recompile the redhat kernel get from the kernel source. cp config/*smp*i686* .config make mrproper make modules make install modules_install Then I get a custom build kernel from the same source (hopefully kernel-source rpm contain all the right patch.) and I pass to QA to test it again. Last week, the first try of custom kernel pass the test. I am asking QA to do more run of it in hope to reproduce the problem so I can insert my debug code to verify where does the corrupt pointer come from. If you or Stephen has some insight on what is going on or have some patch want me to try, please let me know.
Thanks for the info, Christopher. Good luck on your debugging. I'm reassigning this to our VMware contact Todd Barr, since my understanding is that Red Hat doesn't support custom-built kernels. If you can later reproduce this problem on a stock Red Hat kernel, please let us know. Cheers. -ernie
You did not read my comment carefully. We have no problem reproduce the bug on the stock Redhat kernel. I get the custom build kernel is for debugging and better understand the issue. We haven't reproduce it on the custom kernel yet. The bug did not go away because I am trying to build a custom kernel. So far the problem does exist on stock redhat kernel.
Does the problem occur on the stock kernels without any non-RH loadable modules? Specifically, does it fail if vmmon and vmnet are not loaded? I did carefully read the above description but was unable to make this distinction.
Sorry I did not make that clear enough. We can't do the test without the vmmon and vmnet module. The memory load is generate from inside the vmware guest, which is a redhat linux BTW. We can't run vmware without the vmmon module, which you can get all the source code BTW. The good thing about this test is that the memory load is very real. The memory load is generate from the guest OS instead of some simple program try to allocate and touch memory. It is a very good way to test the linux kernel as well. In the history, we have found lots of bugs in kernel or redhat related patches. Please take a look at redhat bug 85275 for example.
The mutual support agreement we have with vmware is that vmware fields issues on their end. Then if they have specific problems, they should be demonstrated on a generic RHEL configuration. vmmon and vmnet are not part of a generic RHEL config.
I agree that vmmon is not part of generic RHEL config. The recompiled kernel pass the test 3 times in a row. Do you feel strange that the stock kernel will crash but the recompile one does not? It might indicate there is something wrong with the stock kernel. Even if I want to nail down what is going on. Right now the problem is points to the binary stock kernel. But I can't debug that.
You are building a kernel with a completely different set of config options. Thats why its different: cp config/*smp*i686* .config make mrproper make modules make install modules_install what that does is whipe the config file entirely and use defconfig, which quite likely will result in an entirely different config.
OOPS, I forget mrproper blow away the .config. I just make a diff on the .config, the most obvious one is the SMP and HIGHIO. I am restarting the process again. Sorry for that.
One question, in what form does vmware store the virtual memory for its guest ? Does it use an mmap()d file, anonymous memory, tmpfs, ... ? If it uses an mmap()d file, can the bug be reproduced on ext2 or only with ext3 ?
It mmap() a file call "ram#". In this case the ram file is on ext3. We can put the ram file at other file system as well. e.g. /dev/shm We will try other file system to narrow down the problem specific to file system or not. But that will take some time. It usually take at least one day to reach the crash. It is running the recompiled kernel right now. Since this is a typical customer setup. We like to find out the root cause of the problem. Thanks for the suggestion.
A few things to update about his bug. - the recompiled kernel with the right config file did not reproduce the bug. - In the stock kernel, change file system mount as ext2 did not reproduce the bug. - Using the redhat update 2 kernel did not reproduce the bug. It seems only the original stock kernel trigger that.
OK, then I guess vmware was triggering a VM bug in GA that was fixed later. Chris, would it be ok if we closed this bug ?
We will be okay with closing this once our logs from the tests have been approved and we get a posting on the RH's HCL. =) I'm waiting on Rob Landry's reply back on this.
I'm okay with closing this bug out. I've updated my host machine to Update2 and the issue is not longer there. Thanks.
OK.