Bug 56919
Summary: | aic7xxx_mod.o causes kernel crash on high load. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Shinya Narahara <naraha_s> | ||||||
Component: | kernel | Assignee: | Arjan van de Ven <arjanv> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brock Organ <borgan> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 7.3 | ||||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | ia64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2003-06-07 23:03:00 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Shinya Narahara
2001-11-30 12:58:22 UTC
We have found the aic7xxx_mod driver to be not as reliable as the aic7xxx.o driver and therefore made the aic7xxx.o driver the default. The problem cannot be swiotlb because older kernels had 32 thousand entries (as opposed to the kernel default of 1024) and the 2.4.9-17.3 kernel doesn't even have swiotlb anymore at all. Is it possible to get the important part of the oopstext ? I will look at the driver with that to see if it is an obvious bug. Unfortunately, we couldn't get the oppstext, because it includes many cpu registers values on 8 cpus machine(and main message's been wiped and scrolled out). The aic7xxx.o(kernel 2.4.9-12) couldn't pass our heatrun test by using bonnie++ neither. We must drive ia64 with kernel-2.4.3-12 + some known patches, so we can't try ext3 yet... 2.4.9-17.3 would be interesting since that no longer has the swiommu.. Also, maybe it is possible to connect serial console to capture the oops message on another computer ? We'll try to capture the oops message to connect serial terminal later. However, the serial(ttyS0,ttyS1) is used by modems now... Created attachment 39388 [details]
kernel log of kernel-2.4.9-12smp + aic7xxx_mod.o
We've confirmed that the RH7.2 for Itanium(and it's kernel) can't clear our heatrun test, with the kernel-smp-2.4.9-18 and kernel-smp-2.4.9-19(from rawhide), using aic7xxx.o or aic7xxx_mod driver. Please see our easy and silly heatrun test script. We have no idea why the Red Hat kernel can't run this test only for 48 hours. This script needs bonnie++ and /dev/sdb, but it's easy to customize for your environment. #!/bin/sh setterm -blank 0 for dev in /dev/sdb1 /dev/sdb2 /dev/sdb3 /dev/sdb4 ; do while : ; do nice -n 19 dd if=$dev of=/dev/null done > /dev/null 2>&1 & done for i in 1 2 3 4 5 6 7 8; do mkdir -p /usr/src/bonnie/$i while : ; do nice -n 19 /root/bonnie++-1.02a/bonnie++ -u root -d /usr/src/bonnie/$i done > /dev/null 2>&1 & done pushd /usr/src/linux-2.4/ while : ; do make clean make -j 8 vmlinux modules done > /dev/null 2>&1 & popd while : ; do echo `uname -r` `date` sleep 600 done We've tested this issue with kernel-2.4.9-20. The smp kernel (8cpu) couldn't clear our sily test above, but up kernel could do with both aic7xxx.o and aic7xxx_mod.o. any comments or requirements for our test? Created attachment 43656 [details]
kernel log of kernel-2.4.3-12smp + aic7xxx_mod.o with our heatrun test, again.
According to our heatrun test and Oops message #43656, we atempted to comment out BUG() macro on slab.c, and it seemed be better, clearing our test for 3days. May we comment it out? Is there any affects to do so? That BUG() triggers if the kernel fails to notify all cpu's of something. This is either a bug in that code, or a motherboard/chipset bug. Interesting.... I've never seen this one before We suppose we've had a solution for this issue. This must be kernel memory problem, twice kfree(). After patching fs/partition/efi.c, we don't have this issue anymore. This patch has great effects because of elemental one. We strongly recommend patching this into your kernel... --- linux-2.4.9-21/fs/partitions/efi.c.orig Fri Mar 1 16:59:19 2002 +++ linux-2.4.9-21/fs/partitions/efi.c Mon Mar 11 16:34:29 2002 @@ -546,8 +547,8 @@ *gpt = pgpt; *ptes = pptes; - if (agpt) kfree(agpt); - if (aptes) kfree(aptes); + if (agpt) { kfree(agpt); agpt=NULL; } + if (aptes) { kfree(aptes); aptes=NULL; } } /* if primary is valid */ else { /* Primary GPT is bad, check the Alternate GPT */ @@ -595,6 +596,8 @@ if (agpt) {kfree(agpt); agpt = NULL;} if (pptes) {kfree(pptes); pptes = NULL;} if (aptes) {kfree(aptes); aptes = NULL;} + *gpt = NULL; + *ptes = NULL; return 0; } scheduled for the next erratum; it indeed looks like a serious bug |