Bug 56919

Summary: aic7xxx_mod.o causes kernel crash on high load.
Product: [Retired] Red Hat Linux Reporter: Shinya Narahara <naraha_s>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED CURRENTRELEASE QA Contact: Brock Organ <borgan>
Severity: high Docs Contact:
Priority: medium    
Version: 7.3   
Target Milestone: ---   
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2003-06-07 23:03:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
kernel log of kernel-2.4.9-12smp + aic7xxx_mod.o
none
kernel log of kernel-2.4.3-12smp + aic7xxx_mod.o with our heatrun test, again. none

Description Shinya Narahara 2001-11-30 12:58:22 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.75 [ja] (WinNT; U)

Description of problem:
During benchmark test for disk on ia64 machine, the kernel opps appeared.
It occured with aic7xxx_mod.o, but didn't with aic7xxx.o.
Kernel versions were 2.4.9-12, 2.4.9-13, 2.4.9-17.3.


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Prepare a ia64 machine with / on /dev/sda(on aha29160).
2. Do eight disk benchmarks test syncronously, by using bonnie++ etc.
3.


Actual Results:  Kernel Oops.

Expected Results:  Nothing'll be occured.

Additional info:

Our test machine is HITACHI CF-1, with 8G memory, aha29160 + 2 scsi disks, i82559, ATI RageXL.
Lsmod appares to be using drivers aic7xxx_mod.o and e100.o without starting X.
Even the driver aic7xxx_mod needs to set the parameter "aic7xxx=tag_info:{{3,3,3,3,....}}
because default tag_info parameter is too large(redhat set it as 253) to do benchmark test,
it might cause kernel opps because of lack of swiotlb.

On Intel Lion with similar components to CF-1 above, we also encounted this issue.
So we guess this is general issue on ia64...

Comment 1 Arjan van de Ven 2001-11-30 13:02:11 UTC
We have found the aic7xxx_mod driver to be not as reliable as the aic7xxx.o
driver and therefore made the aic7xxx.o driver the default.

The problem cannot be swiotlb because older kernels had 32 thousand entries (as
opposed to the kernel default of 1024) and the 2.4.9-17.3 kernel doesn't even
have swiotlb anymore at all.

Is it possible to get the important part of the oopstext ? I will look at the
driver with that to see if it is an obvious bug.

Comment 2 Shinya Narahara 2001-11-30 13:21:00 UTC
Unfortunately, we couldn't get the oppstext, because it includes
many cpu registers values on 8 cpus machine(and main message's been
wiped and scrolled out).

The aic7xxx.o(kernel 2.4.9-12) couldn't pass our heatrun test by using
bonnie++ neither. We must drive ia64 with kernel-2.4.3-12 + some known
patches, so we can't try ext3 yet...


Comment 3 Arjan van de Ven 2001-11-30 13:31:35 UTC
2.4.9-17.3 would be interesting since that no longer has the swiommu..
Also, maybe it is possible to connect serial console to capture the oops message
on another computer ?

Comment 4 Shinya Narahara 2001-11-30 14:14:05 UTC
We'll try to capture the oops message to connect serial terminal later.
However, the serial(ttyS0,ttyS1) is used by modems now...


Comment 5 Shinya Narahara 2001-12-03 06:37:37 UTC
Created attachment 39388 [details]
kernel log of kernel-2.4.9-12smp + aic7xxx_mod.o

Comment 6 Shinya Narahara 2002-01-11 02:55:56 UTC
We've confirmed that the RH7.2 for Itanium(and it's kernel) can't clear our heatrun test,
with the kernel-smp-2.4.9-18 and kernel-smp-2.4.9-19(from rawhide), using aic7xxx.o
or aic7xxx_mod driver.

Please see our easy and silly heatrun test script. We have no idea why the Red Hat
kernel can't run this test only for 48 hours.
This script needs bonnie++ and /dev/sdb, but it's easy to customize for your
environment.

#!/bin/sh

setterm -blank 0

for dev in /dev/sdb1 /dev/sdb2 /dev/sdb3 /dev/sdb4 ; do
  while : ; do
    nice -n 19 dd if=$dev of=/dev/null
  done > /dev/null 2>&1 &
done

for i in 1 2 3 4 5 6 7 8; do
  mkdir -p /usr/src/bonnie/$i
  while : ; do
    nice -n 19 /root/bonnie++-1.02a/bonnie++ -u root -d /usr/src/bonnie/$i
  done > /dev/null 2>&1 &
done

pushd /usr/src/linux-2.4/
while : ; do
  make clean
  make -j 8 vmlinux modules
done > /dev/null 2>&1 &
popd

while : ; do
  echo `uname -r`  `date`
  sleep 600
done


Comment 7 Shinya Narahara 2002-01-21 05:14:53 UTC
We've tested this issue with kernel-2.4.9-20.
The smp kernel (8cpu) couldn't clear our sily test above, but up kernel could do
with both aic7xxx.o and aic7xxx_mod.o.

any comments or requirements for our test?


Comment 8 Shinya Narahara 2002-01-27 04:48:18 UTC
Created attachment 43656 [details]
kernel log of kernel-2.4.3-12smp + aic7xxx_mod.o with our heatrun test, again.

Comment 9 Shinya Narahara 2002-01-27 04:51:36 UTC
According to our heatrun test and Oops message #43656, we atempted to comment out BUG()
macro on slab.c, and it seemed be better, clearing our test for 3days.

May we comment it out? Is there any affects to do so?


Comment 10 Arjan van de Ven 2002-01-27 10:10:58 UTC
That BUG() triggers if the kernel fails to notify all cpu's of something.
This is either a bug in that code, or a motherboard/chipset bug. Interesting....
I've never seen this one before

Comment 11 Shinya Narahara 2002-03-15 00:40:35 UTC
We suppose we've had a solution for this issue.
This must be kernel memory problem, twice kfree(). After patching
fs/partition/efi.c, we don't have this issue anymore.

This patch has great effects because of elemental one. We
strongly recommend patching this into your kernel...

--- linux-2.4.9-21/fs/partitions/efi.c.orig	Fri Mar  1 16:59:19 2002
+++ linux-2.4.9-21/fs/partitions/efi.c	Mon Mar 11 16:34:29 2002
@@ -546,8 +547,8 @@
 
 		*gpt = pgpt;
 		*ptes = pptes;
-		if (agpt)  kfree(agpt);
-		if (aptes) kfree(aptes);
+		if (agpt)   { kfree(agpt); agpt=NULL; }
+		if (aptes)  { kfree(aptes); aptes=NULL; }
 	} /* if primary is valid */
 	else {
 		/* Primary GPT is bad, check the Alternate GPT */
@@ -595,6 +596,8 @@
         if (agpt) {kfree(agpt); agpt = NULL;}
         if (pptes) {kfree(pptes); pptes = NULL;}
         if (aptes) {kfree(aptes); aptes = NULL;}
+        *gpt = NULL;
+        *ptes = NULL;
 	return 0;
 }


Comment 12 Arjan van de Ven 2002-04-02 13:48:21 UTC
scheduled for the next erratum; it indeed looks like a serious bug