74554 – Periodical hang of system on harddisk I/O

Bug 74554 - Periodical hang of system on harddisk I/O

Summary: Periodical hang of system on harddisk I/O

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.3
Hardware:	athlon
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-09-26 16:24 UTC by Mathias Retzlaff
Modified:	2008-08-01 16:22 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-09-30 15:39:57 UTC
Embargoed:

Attachments	(Terms of Use)

Description Mathias Retzlaff 2002-09-26 16:24:40 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt; 
UiuqmHmqouVilORJ)

Description of problem:
Periodical hang of system (~10 sec)
-----------------------------------

OS: Redhat Linux 7.3  Kernel 2.4.18-3custom #5 SMP i686 
CPU: Dual Athlon 2000+
RAM: 2 GB registered ECC
HardDisks: 2x Maxtor 80GB
              2x 15GB SoftwareRaid-Partions (Mirrored) are in use

Description:
Systems behaves normally until I'm starting an java-based Chatserver that 
creates an great amount of load on harddisks.

From that point on system stops every ~350(this number varies very much) sec 
and hangs for ~10-15 sec. After that it continues without any problems until 
next 350 sec are over and then pauses again.

By testing I found out that when the error occures all threads in the system go 
on running, until they do I/O to the harddisk. At that point they stop and wait 
for more than 10 seconds and then go on with their work. Threads that do not 
write to the harddisk do not stop, even if they make output to networkcard or 
console.

My first idea was, that this was caused by the journaling of the ext3-fs. So I 
changed the "ext3" entries in fstab to "ext2" and did a reboot, but the error 
keeps on occuring.

The file-system is a SoftwareRaid - RaidLevel 1  with ext3 used on it.

Here is the output of a perl-script I wrote. It sleeps for 1 second and then 
checks, how long it really did sleep. If this script only writes to console and 
I look at the output in my ssh terminal, it does not notice any unnormally long 
sleeps. If it writes to the harddisk (by redirecting the output to a file) it 
does notice unnormally long sleeps:

[...]
Sleep for: 11 sec   Time since lastoccurence: 408 sec
Sleep for:  6 sec   Time since lastoccurence: 297 sec
Sleep for: 13 sec   Time since lastoccurence: 325 sec
Sleep for: 12 sec   Time since lastoccurence: 275 sec
Sleep for: 11 sec   Time since lastoccurence: 408 sec
Sleep for:  6 sec   Time since lastoccurence: 297 sec
Sleep for: 13 sec   Time since lastoccurence: 325 sec
Sleep for: 12 sec   Time since lastoccurence: 275 sec
Sleep for: 13 sec   Time since lastoccurence: 499 sec
Sleep for: 15 sec   Time since lastoccurence: 342 sec
Sleep for:  9 sec   Time since lastoccurence: 260 sec
Sleep for: 14 sec   Time since lastoccurence: 728 sec
[...]

At the exactly same moments also the chatserver(and as far as I can see, every 
other thread either) waits for the same amount of time.

So I would be very pleased, if someone could give me a solution to the problem 
or a hint what to look for.
If you require more information about the system, please mail me at
hangbug

Thanks in advance

Mathias Retzlaff


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Don't know how to reproduce it on another system.
2.
3.
	

Actual Results:  the system stops periodically

Expected Results:  the system should run without hanging

Additional info:

Comment 1 Arjan van de Ven 2002-09-26 16:27:18 UTC

can you check if IDE DMA is turned on (hdparm -i /dev/hda) ?

Comment 2 Mathias Retzlaff 2002-09-26 16:33:34 UTC

[root]# hdparm -i /dev/hda

/dev/hda:

 Model=MAXTOR 6L080L4, FwRev=A93.0500, SerialNo=664219358072
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
 RawCHS=16383/16/63, TrkSize=32256, SectSize=21298, ECCbytes=4
 BuffType=DualPortCache, BuffSize=1819kB, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=156355584
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes: pio0 pio1 pio2 pio3 pio4 
 DMA modes: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 udma6 
 AdvancedPM=no WriteCache=enabled
 Drive Supports : ATA/ATAPI-5 T13 1321D revision 1 : ATA-1 ATA-2 ATA-3 ATA-4 
ATA-5

Comment 3 Stephen Tweedie 2002-09-26 16:52:51 UTC

hdparm -i only shows what we negotiated at fs discovery time.  If we've dropped
to pio due to IO errors, you'll need normal "hdparm /dev/hd*" output to show that.

Are there any kernel messages showing up in /var/log/messages which might
indicate problems talking to this disk?

Also, do you see the same problem if you use the standard Red Hat kernels?

Comment 4 Mathias Retzlaff 2002-09-26 18:17:13 UTC

[root]# hdparm /dev/hda

/dev/hda:
 multcount    = 16 (on)
 I/O support  =  0 (default 16-bit)
 unmaskirq    =  0 (off)
 using_dma    =  1 (on)
 keepsettings =  0 (off)
 nowerr       =  0 (off)
 readonly     =  0 (off)
 readahead    =  8 (on)
 geometry     = 9732/255/63, sectors = 156355584, start = 0
 busstate     =  1 (on)

[root]# hdparm /dev/hdc

/dev/hdc:
 multcount    = 16 (on)
 I/O support  =  0 (default 16-bit)
 unmaskirq    =  0 (off)
 using_dma    =  1 (on)
 keepsettings =  0 (off)
 nowerr       =  0 (off)
 readonly     =  0 (off)
 readahead    =  8 (on)
 geometry     = 9732/255/63, sectors = 156355584, start = 0
 busstate     =  1 (on)

--------------------------------------------------------------------------
/var/log/messages:

[...]
Sep 22 00:22:56 kernel: Uniform Multi-Platform E-IDE driver Revision: 6.31
Sep 22 00:22:56 kernel: ide: Assuming 33MHz system bus speed for PIO modes; 
override with idebus=xx
Sep 22 00:22:56 kernel: AMD7441: IDE controller on PCI bus 00 dev 39
Sep 22 00:22:56 kernel: AMD7441: chipset revision 4
Sep 22 00:22:56 kernel: AMD7441: not 100%% native mode: will probe irqs later
Sep 22 00:22:56 kernel: AMD7441: disabling single-word DMA support (revision < 
C4)
Sep 22 00:22:56 kernel:     ide0: BM-DMA at 0xb800-0xb807, BIOS settings: 
hda:DMA, hdb:DMA
Sep 22 00:22:56 kernel:     ide1: BM-DMA at 0xb808-0xb80f, BIOS settings: 
hdc:DMA, hdd:pio
Sep 22 00:22:56 kernel: hda: MAXTOR 6L080L4, ATA DISK drive
Sep 22 00:22:56 kernel: hdb: FX54++W, ATAPI CD/DVD-ROM drive
Sep 22 00:22:56 kernel: hdc: MAXTOR 6L080L4, ATA DISK drive
Sep 22 00:22:56 kernel: ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
Sep 22 00:22:56 kernel: ide1 at 0x170-0x177,0x376 on irq 15
Sep 22 00:22:56 kernel: blk: queue c03cd2c4, I/O limit 4095Mb (mask 0xffffffff)
Sep 22 00:22:56 kernel: hda: 156355584 sectors (80054 MB) w/1819KiB Cache, 
CHS=9732/255/63, UDMA(100)
Sep 22 00:22:56 kernel: blk: queue c03cd628, I/O limit 4095Mb (mask 0xffffffff)
Sep 22 00:22:56 kernel: hdc: 156355584 sectors (80054 MB) w/1819KiB Cache, 
CHS=155114/16/63, UDMA(100)
[...]

--------------------------------------------------------------------------
I did not try standard kernel yet.

Only change I made in Kernel is, to rise the limit for open file per process
from 1024 to 16384.

Comment 5 Arjan van de Ven 2002-09-26 18:18:51 UTC

ehm that limit is a runtime tunable.. no need to recompile for that ;(

Comment 6 Mathias Retzlaff 2002-09-27 14:12:57 UTC

I can't find the hard limit for maximum number of open file per process to be 
runtime tunable.

The current value I compiled into the kernel is 16384.
When I'm searching for "16384" in 

[root]# sysctl -A 

I do find it three times:
1.) net.ipv4.tcp_wmem = 4096        16384   131072
2.) net.ipv4.route.gc_thresh = 16384
3.) kernel.msgmnb = 16384

and none of those is the value I'm looking for.

The things I changed are:
-------------------------
include/linux/fs.h:
old:
#define INR_OPEN 1024
...
#define NR_FILE  8192

new:
#define INR_OPEN 16384
...
#define NR_FILE  32768
-------------------------
include/linux/limits.h:
old:
#define NR_OPEN 1024

new:
#define NR_OPEN 16384
-------------------------

If I am wrong please tell me so.
I'm going to test the standard kernel to see whether the hanging will disappear.
(Post here tomorrow morning (CET)).

Thanks in advance for your help.

Mathias Retzlaff

Comment 7 Stephen Tweedie 2002-09-27 14:30:55 UTC

Umm, changing NR_OPEN will break things in non-obvious ways, especially if you
have any old binaries lying around.

The correct way to do this is with the setrlimit syscall, or the corresponding
shell command ("ulimit" in bash.)  Unprivileged users cannot raise the soft
limit above the hard limit, so if the hard limit is set to 1024, that's a fixed
ceiling unless root changes it.  Root can change the limits arbitrarily, though.

/etc/security/limits.conf will let you change the default limits for users.  If
you want particular users to be able to use more than 1024 fds, I'd recommend
increasing the hard limit but leaving the soft limit at 1024.  That way, users
will still get a default 1024 fds, but they will be able to raise that
themselves if they want more.  That will allow an application to use more fds if
it really needs to, without risking breaking old apps which cannot cope with so
many fds.

Comment 8 Mathias Retzlaff 2002-09-28 09:31:48 UTC

This morning I tested both:
booting system with standard-SMP-kernel an with standard-UniProcessor-kernel but 
the system keeps on hanging periodically.

The currently running system is standard-SMP-kernel
(Linux  2.4.18-3smp #1 SMP Thu Apr 18 06:59:55 EDT 2002 i686 unknown)
and I used /etc/security/limit.conf for raising the max. number of open files.

So if you need any further information just ask for it.

I'm wondering why kjournald is still running and consuming cputime, although I 
did switch every partions filesystem to ext2 ...

Comment 9 Bugzilla owner 2004-09-30 15:39:57 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Note You need to log in before you can comment on or make changes to this bug.