Bug 121728
Summary: | kernel BUG at buffer.c:604 on memory stress test | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Christopher Li <chrisl> |
Component: | kernel | Assignee: | Rik van Riel <riel> |
Status: | CLOSED WORKSFORME | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 3.0 | CC: | petrides, riel, schou, sct, tburke |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2004-06-14 20:40:36 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Christopher Li
2004-04-26 19:38:03 UTC
Can this problem be reproduced on an untainted kernel? The module that cause the kernel to be tained is vmmon and vmnet. But this two modules are open sourced. You can get all the source code of the module. The source of the module came with VMware. It is just not specify as GPL yet. Let me know if you have any problems with that. Let me plug in more progress from my side. I am looking at the bug also. The assert complain bh lru pointer is not NULL when it about to insert to the lru list. Well it just be take out from lru and normally it should reset the two lru pointer to NULL. And all this is protect by the lru spin lock. All this looks sane. It might have other path get into the bh to set the bh pointer I don't know about. So I start recompile the redhat kernel get from the kernel source. cp config/*smp*i686* .config make mrproper make modules make install modules_install Then I get a custom build kernel from the same source (hopefully kernel-source rpm contain all the right patch.) and I pass to QA to test it again. Last week, the first try of custom kernel pass the test. I am asking QA to do more run of it in hope to reproduce the problem so I can insert my debug code to verify where does the corrupt pointer come from. If you or Stephen has some insight on what is going on or have some patch want me to try, please let me know. Thanks for the info, Christopher. Good luck on your debugging. I'm reassigning this to our VMware contact Todd Barr, since my understanding is that Red Hat doesn't support custom-built kernels. If you can later reproduce this problem on a stock Red Hat kernel, please let us know. Cheers. -ernie You did not read my comment carefully. We have no problem reproduce the bug on the stock Redhat kernel. I get the custom build kernel is for debugging and better understand the issue. We haven't reproduce it on the custom kernel yet. The bug did not go away because I am trying to build a custom kernel. So far the problem does exist on stock redhat kernel. Does the problem occur on the stock kernels without any non-RH loadable modules? Specifically, does it fail if vmmon and vmnet are not loaded? I did carefully read the above description but was unable to make this distinction. Sorry I did not make that clear enough. We can't do the test without the vmmon and vmnet module. The memory load is generate from inside the vmware guest, which is a redhat linux BTW. We can't run vmware without the vmmon module, which you can get all the source code BTW. The good thing about this test is that the memory load is very real. The memory load is generate from the guest OS instead of some simple program try to allocate and touch memory. It is a very good way to test the linux kernel as well. In the history, we have found lots of bugs in kernel or redhat related patches. Please take a look at redhat bug 85275 for example. The mutual support agreement we have with vmware is that vmware fields issues on their end. Then if they have specific problems, they should be demonstrated on a generic RHEL configuration. vmmon and vmnet are not part of a generic RHEL config. I agree that vmmon is not part of generic RHEL config. The recompiled kernel pass the test 3 times in a row. Do you feel strange that the stock kernel will crash but the recompile one does not? It might indicate there is something wrong with the stock kernel. Even if I want to nail down what is going on. Right now the problem is points to the binary stock kernel. But I can't debug that. You are building a kernel with a completely different set of config options. Thats why its different: cp config/*smp*i686* .config make mrproper make modules make install modules_install what that does is whipe the config file entirely and use defconfig, which quite likely will result in an entirely different config. OOPS, I forget mrproper blow away the .config. I just make a diff on the .config, the most obvious one is the SMP and HIGHIO. I am restarting the process again. Sorry for that. One question, in what form does vmware store the virtual memory for its guest ? Does it use an mmap()d file, anonymous memory, tmpfs, ... ? If it uses an mmap()d file, can the bug be reproduced on ext2 or only with ext3 ? It mmap() a file call "ram#". In this case the ram file is on ext3. We can put the ram file at other file system as well. e.g. /dev/shm We will try other file system to narrow down the problem specific to file system or not. But that will take some time. It usually take at least one day to reach the crash. It is running the recompiled kernel right now. Since this is a typical customer setup. We like to find out the root cause of the problem. Thanks for the suggestion. A few things to update about his bug. - the recompiled kernel with the right config file did not reproduce the bug. - In the stock kernel, change file system mount as ext2 did not reproduce the bug. - Using the redhat update 2 kernel did not reproduce the bug. It seems only the original stock kernel trigger that. OK, then I guess vmware was triggering a VM bug in GA that was fixed later. Chris, would it be ok if we closed this bug ? We will be okay with closing this once our logs from the tests have been approved and we get a posting on the RH's HCL. =) I'm waiting on Rob Landry's reply back on this. I'm okay with closing this bug out. I've updated my host machine to Update2 and the issue is not longer there. Thanks. OK. |