Bug 18334

Summary: Heavy I/O load causes deadlock
Product: [Retired] Red Hat Linux Reporter: Matt Domsch <matt_domsch>
Component: kernelAssignee: Michael K. Johnson <johnsonm>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 7.3CC: notting
Target Milestone: ---   
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2000-10-18 18:39:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
cpcmp.tgz none

Description Matt Domsch 2000-10-04 15:05:02 UTC
Using the 2.4.0-0.31 IA-64 kernel (and before), under heavy I/O disk load 
the system gets into a deadlock state.  Using Magic-Sysrq, it appears to 
be waiting on one or more spinlocks.  The spinlocks I've observed it 
waiting on include:

file_move() getting file_list_lock.
sync_old_buffer() getting the big kernel lock.
schedule() getting the runqueue lock.
force_sig_info()
non_syscall()
restore_all()

Issue seen on SMP systems (2 or 4 B0-step processors) with either 1GB or 
5GB of RAM.

I'll attach a copy-and-compare tool that brings out the behavior in 
several seconds.

cpcmp.tgz attached.  Untar, cd cpcmp.
The command looks like:
./cpcmp.pl 6 /usr/src/linux-2.4 /mnt/disk2/a,/mnt/disk1/b,50000,20

6 is the syslog level at which to write messages.
/usr/src/linux-2.4 is your original set of information.
This first gets copied to /mnt/disk2/a{1..20}/.
Then data gets copied from /mnt/disk2/a{1..20}/ to /mtn/disk1/b{1..20}/.
Each thread runs 50000 times (enough to last a long long time)
Run 20 threads.

Of course, change parameters at will.  This test with 2 disks for me died 
in about 2.5 seconds waiting on a spinlock.

Comment 1 Matt Domsch 2000-10-04 15:06:19 UTC
Created attachment 3740 [details]
cpcmp.tgz

Comment 2 Matt Domsch 2000-10-04 15:09:57 UTC
This was reproduced on Dell "Bordeaux" systems (Intel Lion beta units), Intel 
BIOS 56.

Comment 3 Matt Domsch 2000-10-04 18:26:45 UTC
Uniprocessor kernel on the same system does not fail.  Same test has been 
running for several hours now, no problems.  My guess is that either a) the 
processor B0 stepping isn't guaranteeing atomicity wrt spinlock operations, or 
b) the ia64 spinlock operations are wrong, or c) the compiler is generating bad 
assembly for the SMP spinlock functions.


Comment 4 Matt Domsch 2000-10-04 20:13:30 UTC
The 2.4.0-0.31 kernel SRPM doesn't work on IA-32 platforms due to the fact that 
the SCSI layer gets initialized twice, which was fixed in the 2.4.0-test9 
series.  Running the i386 .config file from the 2.4.0-0.31 SRPM on a 2.4.0-
test9-final kernel on my IA-32 SMP system (Dell PowerEdge 2400), and running 
cpcmp.pl, does not lock up (after 20 minutes, and I'll let it run).


Comment 5 Matt Domsch 2000-10-18 18:39:13 UTC
kernel 2.4.0-test9 + the IA-64 -test9 patch + modutils 2.3.18 solves the 
problem.  Tests ran > 16 hours with no failures.