Bug 222152

Summary: stack overflow in do_IRQ under heavy disk load with HPT ide driver md lvm reiser fs
Product: [Fedora] Fedora Reporter: Wayne H Cox <wayne.cox>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 6CC: bugzilla, jonstanley, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-02-08 04:24:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 427887    

Description Wayne H Cox 2007-01-10 18:21:20 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1

Description of problem:
Using 4 160GB drives attached to a HighPoint Raid controller.  The controller has 4 IDE devices all on the same IRQ.  I used the primary channel on each.

Each disk has a single partition, and they are all joined into a raid 5 array.

The raid array was added to LVM as a single PV.

I created a 120GB logical volume and used mkreiserfs to put down the FS metadata.  After mounting the file system, I tried to copy about 16GB of data to the new file system.  The system locked up solid after about 3 to 4 GB was copied.

I found that by slowing down the copy by using compress/decompress in a pipeline allowed me to get the whole 16GB copied.  I then tried copying the 16GB off.  Same result, system locked up hard.  I was able to copy the data off using the compress/decompress trick.

I searched the Internet and the driver source and found the idrX=serialize options.  Using ide2=serialize ide3=serialize ide4=serialize ide5=serialize helped.  Switching to the ext3 file system also helps.

I'm now using the ideX=serialize and ext3 file system.  The system can still be made to crash, but I have to try a lot harder.

My suggestion is to have all devices on all ide interfaces being serviced by the hpt34x driver be in the same 'hwgroup'.

Version-Release number of selected component (if applicable):
kernel-2.6.18-1.2869.fc6

How reproducible:
Sometimes


Steps to Reproduce:
1.create md raid 5 device (I had 4 disks on highpoint controler)
2.create volume group and a logical volume
3.mkreiserfs and mount it
4.copy huge amount of data to new file system


Actual Results:
System locked up hard, needing hardware reset button to clear.

Expected Results:
Nothinng

Additional info:
adding ide2=serialize ide3=serialize ide4=serialize ide5=serialize helped.
switching to ext3 also helps.  Lockups are now few and far between.

Comment 1 Chris Schanzle 2007-01-30 23:53:11 UTC
I think this is a 4k stacks/do_IRQ stack overflow symptom.  I have a similar
setup, with 4 750GB Seagate SATA drives connected to NV ports on Asus A8N-SLI
Premium.  Propriatary NVIDIA driver uninstalled.  Been booting into runlevel 3,
logging on the console and turning off screenblanking (setterm -blank 0
-powersave off).  I was able to see only "do_IRQ: stack overflow: " followed by
496 or 504 or other various similar numbers.  No stack dump was displayed,
system locked hard and needed reset.  Only occurred under high I/O load, caused
locally or via NFS.

I was originally using XFS and found xfs maintainers admit RAID5 + LVM + XFS
causes stack overflows (XFS uses too much stack space).

Switched to Reiserfs and it was better, but still happens occasionally at the
worst times.

Hate to say it, but a self-compiled 2.6.20-rc6 with reiserfs has been rock solid
for a week and I've pounded the box hard on occasion (syncing homedirs with
Unison was my latest repeatable crash method).

Chris

Comment 2 Doug Dumitru 2007-02-26 06:48:04 UTC
I am seeing similar problems with 2.6.19-1.2911 running xen0.  Systems with 3
and 4 drives fail when moving large amounts of data on and off of reiserfs. 
Sometimes the system will reboot and sometimes hang.  Here is a typical message:

h-xxxx.easyco.net login: do_IRQ: stack overflow: 480
(XEN) (file=x86_emulate.c, line=1152) Cannot emulate 57
(XEN) domain_crash_sync called from entry.S (ff1611d9)
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
(XEN) ----[ Xen-3.0.3-0-1.2911.fc6  x86_32p  debug=n  Not tainted ]----
(XEN) CPU:    0
(XEN) EIP:    0061:[<c061b83f>]
(XEN) EFLAGS: 00010296   CONTEXT: guest
(XEN) eax: e93e8008   ebx: ed749190   ecx: 0000007b   edx: 00000000
(XEN) esi: c0684b54   edi: c061b83e   ebp: 00000011   esp: e93e8000
(XEN) cr0: 8005003b   cr4: 000006f0   cr3: 13633000   cr2: e93e7ffc
(XEN) ds: 007b   es: 007b   fs: 0000   gs: 0033   ss: 0069   cs: 0061
(XEN) Guest stack trace from esp=e93e8000:
(XEN)   Stack empty.
(XEN) Domain 0 crashed: rebooting machine in 5 seconds.
(XEN) AMD SVM Extension is disabled.

The crash messages vary.  When in XEN, this seems to be the most details. 
Sometimes when not running XEN, I will get full stack dumps.  Sometimes it loops
forever giving stack dumps.  Sometimes it only display the stack overflow line.
 Sometimes it displays nothing.  This is from a serial console, so perhaps it is
too deep in the IRQ handler to keep the serial alive.

I have recompiled unchecking the 4K stacks, and this seems to help a lot. 
Perhaps 4K stacks is a dangerous default.

Comment 3 Jon Stanley 2008-01-08 01:47:07 UTC
(This is a mass-update to all current FC6 kernel bugs in NEW state)

Hello,

I'm reviewing this bug list as part of the kernel bug triage project, an attempt
to isolate current bugs in the Fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

I am CC'ing myself to this bug, however this version of Fedora is no longer
maintained.

Please attempt to reproduce this bug with a current version of Fedora (presently
Fedora 8). If the bug no longer exists, please close the bug or I'll do so in a
few days if there is no further information lodged.

Thanks for using Fedora!

Comment 4 Jon Stanley 2008-02-08 04:24:12 UTC
Per the previous comment in this bug, I am closing it as INSUFFICIENT_DATA,
since no information has been lodged for over 30 days.

Please re-open this bug or file a new one if you can provide the requested data,
and thanks for filing the original report!