Bug 77564

Summary: UDMA causes serious filesystem/hdd lag
Product: [Retired] Red Hat Linux Reporter: Chris Pauly <quiff>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED WORKSFORME QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 8.0   
Target Milestone: ---   
Target Release: ---   
Hardware: athlon   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2002-11-13 12:08:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chris Pauly 2002-11-09 09:03:33 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 
1.0.3705)

Description of problem:
Using any sort of UDMA mode causes the HDD/filesystem to lock up for any 
period of time when there is heavy disk activity.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. TEST 1: Copy a large file (300+MB) from the network to RedHat's HDD using 
FTP or Samba, and watch bandwidth monitor on other computer. This is using a 
default RedHat install with no HDD or FS tweaks whatsoever.
2. TEST 2: Copy a large file from the network to RedHat using FTP, 
to /dev/null, and watch bandwidth monitor.
3. TEST 3: Turn "unmaskirq" off (unsure if made a difference), and switch 
to "mdma2" mode, using hdparm. Repeat the same 300+MB network file transfer 
test over Samba or FTP, writing to the HDD.
4. TEST 4: Try to get drive stats using hdparm directly after cancelling 
failed laggy transfer of test 1.
5. TEST 5: Switch to any PIO mode using hdparm and retry 300+MB network HDD 
write test.


Actual Results:  TEST 1: The transfer goes nice at the start, about ~9MB/s, 
then suddenly dips to ZERO throughput after about 5 seconds or less (could be 
after writes are committed from cache/journal? - which happens every 5 
seconds, right?), and remains at zero throughput for a few seconds, then 
starts going at ~9MB/s again for a few seconds, then zero throughput again, 
and repeat until the file is done. Sometimes it will hover between ~0.5MB/s 
and ~4MB/s after a while, going up and down continuously.

TEST 2: Transfer is a solid ~9MB/s all the way through to the end of the file.

TEST 3: Transfer is a solid ~5MB/s all the way through.

TEST 4: hdparm hangs for upto a few minutes until finally returning to the 
prompt. Ctrl-c does not work.

TEST 5: The transfer goes at about 9MB/s for upto 5 seconds, again, and then 
suddenly dips to ~2MB/s. After a few seconds it goes back up to 9MB/s. Then 
back to 2MB/s. Repeat until file is done. It never dips to zero throughput, 
and is consistent in it's 2MB/s<->9MB/s switches.


Expected Results:  TEST 1: Solid 9MB/s, just like RedHat 7.3 used to do for 
me, and just like RedHat 7.3 is doing for me now after i reinstalled it over 
the top of the *extremely buggy* RedHat 8.0.

TEST 2: As expected.

TEST 3: As expected. Multi-word DMA isn't as good as Ultra DMA, of course, but 
this perhaps proves that there is a problem with the UDMA code in RedHat 8.0, 
but not PIO or MDMA.

TEST 4: hdparm should respond immediately and return the command prompt, like 
it does normally when there is low disk activity.

TEST 5: I guess i expect the actual results. I'm not too sure on how 
journalling works, or how RedHat's write caching works, but it appears that 
PIO mode has no problem. 9MB/s into the cache, then while it's 
committing/flushing cache, it throttles to about 2MB/s, then fills up the 
cache again, then repeat.


Additional info:

Redhat 7.3 works great. RedHat 8.0 doesn't. Yet i can't see much difference in 
the kernel versions.

I am using a Asus A7V133-C motherboard, VIA KT133A chipset, UATA66 Quantum 
Fireball Plus KA 9.1GB (Primary Master, UDMA66+ cable), UATA100 Maxtor 
DiamondMax Plus D740X 80GB (Secondary Master UDMA66+ cable), SMC 1211TX 
Realtek NIC, TNT2. All filesystems written to during the tests were ext3.

Note that i didn't do much testing on the Maxtor HDD, and the tests that i did 
with multi-word DMA worked fine on the Quantum HDD, but NOT on the Maxtor. The 
Maxtor had problems no matter what i did.

I also didn't do any local tests, as for the major part of my testing + 
frustration, i was under the impression that the NIC was having problems. I 
now have RedHat 7.3 installed, so it's too late to test.

These tests were done on a standard RedHat 8.0 install with very little 
configuration, and up2date'd (including kernel) as of 8th November 2002.

Also note, i am not sure if this is a *kernel* bug, or a bug of the files 
which interact with the kernel. Seeing as the kernel did not change much 
between 7.3 and 8.0, i would guess that it might not be a kernel bug, but 
where the heck this bug report belongs to, i don't know.

Again, RedHat 7.3, re-installed this morning, up2date'd (including) kernel) as 
of 9th November 2002, works great, just like RedHat 7.3 has always worked for 
me.

Comment 1 Arjan van de Ven 2002-11-09 09:30:55 UTC
this is funny since the 7.3 update kernel is very identical to the 8.0 update
kernel ;(

Comment 2 Chris Pauly 2002-11-12 10:21:52 UTC
Yeah, i thought they looked pretty identical, which is why i'm puzzled as to 
whether this is even a kernel bug. :(

Well.. I was busy up until today so i haven't been able to test more until 
now. Now i find that i have the same problem again in RedHat 7.3 - up2date'd 
or with the original rpms (but not a fresh install, just rpm -U --oldpackage 
back to the original rpms).

It *was* working fine before i did more post-install configuration, which 
brings me to believe i've done something to the BIOS, or RedHat has done 
something funky with my configuration after i installed more stuff.

I have encountered a problem in the past with installed programs (rp-pppoe, in 
fact) stomping over the kernel, causing performance issues and forcing me to 
reboot (even killing the program won't work), so i'm not going to ignore the 
possibility that something has fiddled with settings or is running at startup 
and causing problems with the kernel.

Not sure if it means much, but some relevant dmesg output is:

Uniform Multi-Platform E-IDE driver Revision: 6.31
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
VP_IDE: IDE controller on PCI bus 00 dev 21
VP_IDE: chipset revision 6
VP_IDE: not 100% native mode: will probe irqs later
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
VP_IDE: VIA vt82c686b (rev 40) IDE UDMA100 controller on pci00:04.1
    ide0: BM-DMA at 0xd800-0xd807, BIOS settings: hda:DMA, hdb:pio
    ide1: BM-DMA at 0xd808-0xd80f, BIOS settings: hdc:DMA, hdd:pio
hda: QUANTUM FIREBALLP KA9.1, ATA DISK drive
hdc: MAXTOR 6L080J4, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
ide1 at 0x170-0x177,0x376 on irq 15
hda: 18041184 sectors (9237 MB) w/371KiB Cache, CHS=1123/255/63, (U)DMA
hdc: 156355584 sectors (80054 MB) w/1819KiB Cache, CHS=155114/16/63, UDMA(100)

Does "VP_IDE: not 100% native mode: will probe irqs later" mean anything bad?

I'll continue testing. Please let me know if there's a specific way i can test 
or change some useful settings.

Is it worth installing some older or newer kernels (rawhide maybe? or build a 
new kernel directly from kernel.org?) and see if they work?



Comment 3 Chris Pauly 2002-11-13 12:07:53 UTC
Ok, nevermind, might as well close this bug. I've narrowed it down even 
further and it turns out that it's yet *another* bad Maxtor product. Probably 
not a faulty drive, just Maxtor's terrible products. I've had nothing but 
problems with them in the past. This drive didn't work properly in Windows to 
start with so i moved it to Linux for storage in hope that it'd work well.

I tested the Maxtor and Quantum drives on a KT333 motherboard and they acted 
the same.

The Maxtor's problems are compounded by ext3's journaling. When i mount the 
partitions on both drives as ext2, the Quantum goes a solid 9.5-11.5MB/s in 
both directions (read/write) in multiple tests of ~700MB files - absolutely no 
problem at all even in udma66. With ext3 though it dips every ~5 seconds a 
little when the journal/buffer is flushed (i guess? honestly i don't know much 
about journaling) and gives somewhat slower performance (but acceptable and 
bug-free).

However, the Maxtor and ext2 start to have problems after about 10 seconds of 
writing ~9MB/s. It falls down to about 0.5-2MB/s, hovering up and down between 
the two. If i abort the transfer and quickly delete the file, delete takes up 
to a minute to complete, whereas on the Quantum drive, deletion is instant. In 
ext3, the 9MB/s write spurts end after a few seconds and sink to zero 
throughput for a few seconds, then repeats. Reading from Maxtor in ext2 or 
ext3 works like writing in ext3, it goes at 9MB/s for a few seconds, then 
0MB/s for a few seconds, and so on.

Sorry for the hassle.

The lesson? Don't by anything Maxtor. :)