Bug 493513

Summary:	High system CPU load under heavy disk I/O
Product:	[Fedora] Fedora	Reporter:	Larkin Lowrey <llowrey>
Component:	kernel	Assignee:	Kernel Maintainer List <kernel-maint>
Status:	CLOSED WONTFIX	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	low
Version:	10	CC:	jmoyer, kernel-maint, maxim.yegorushkin
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-12-18 09:10:22 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Larkin Lowrey 2009-04-02 03:28:27 UTC

Description of problem:
Simple disk activity causes high system cpu time. A simple sequential read, on an idle system, causes a 25% system cpu load (according to top and vmstat). Reading from, or writing to, more than 4 disks causes the cpu utilization to hit 100% and the individual disk throughput to be reduced.

For example, doing a sequential read from 10 disks results in reads of about 19MB/s per drive when each yields 60MB/s alone.

When I boot from a RIPLinux CD (with a 2.6.18 x86_64 kernel), the same system will easily read at full speed (60MB/s) from each drive with a total aggregate read rate of over 600MB/s (for 10 drives). When I boot the RIPLinux CD I mount the FC10 root and chroot to it, so the only difference is, or should be, the kernel configuration.

I have another host, also running FC10, which is running a custom 2.6.28 kernel which was built using the stock FC10 kernel .config file. It has the same high system CPU problem so I know the problem is not a .27 vs .28 kernel issue.

The only conclusion I can draw is that the FC10 kernels are configured in a way that causes problem I'm experiencing.

I have tried turning off acpi in the bios, booting with acpi=off, nohz=off, and have tried all of the io schedulers with no significant change.

Version-Release number of selected component (if applicable):
Kernel 2.6.27.19-170.2.35.fc10.x86_64

How reproducible:
Every time, every boot.

Steps to Reproduce:
1. dd if=/dev/sda of=/dev/null
2. watch vmstat to see high ~20+% system load
3. do the same with two dd's simultaneously to see 40% system load, 60% for 3 dd's, etc, etc, etc.

Actual results:
20% system cpu usage reading from one drive
40% system cpu usage reading from two drives
... etc.

When reading from more than 4 drives simultaneously, the read rate from each drive is reduced.

Expected results:
1) System load less than 20% per sequential read.
2) Ability to sequentially read from 10 drives simultaneously at full speed.

Additional info:
Both systems tested are nforce4 Athlon64 systems. One with a dual core, one with a single core. Both with 2GB RAM. All disks are SATA and connected via controllers attached via PCI Express. Controller drivers used are sata_nv, sata_mv, sata_sil24.

Trying to resync a 10 drive raid5 array is extremely slow, ~10MB/s (per drive) with the FC10 kernel, but a correct 60MB/s (per drive) with the RIPLinux kernel.

Comment 1 Larkin Lowrey 2009-04-03 17:13:27 UTC

Summary: disabling tickless mode increases raw disk performance by 55% with acpi_pm clocksource, and 61% when using tsc clocksource - still 21% slower than RIPLinux kernel.

New datapoints:

FC10 2.6.27.19-170.2.35.fc10.x86_64 stock

acpi_pm clocksource
8 drives: max 27MB/s (per drive)
11 drives: max 19MB/s (per drive)

tsc clocksource
8 drives: max 31MB/s (per drive)
11 drives: max 21MB/s (per drive)

2.6.29, built using FC10 .config, but with CONFIG_NO_HZ and CONFIG_HIGH_RES_TIMERS not set

acpi_pm clocksource
8 drives: max 42MB/s (per drive)
11 drives: max 29MB/s (per drive)

tsc clocksource
8 drives: max 50MB/s (per drive)
11 drives: max 35MB/s (per drive)

RIPLinux 2.6.28 stock
8 drives: max 58MB/s (per drive - less than 100% CPU)
11 drives: max 44MB/s (per drive)

Comment 2 Chuck Ebbert 2009-04-03 21:49:59 UTC

Does adding "nohz=off highres=off" to the stock kernel's boot options make any difference? How about just "nohz=off"? You could also try "iosched=as".

Comment 3 Larkin Lowrey 2009-04-04 02:09:56 UTC

FC10 2.6.27.19-170.2.35.fc10.x86_64 stock
/proc/cmdline: ro root=/dev/Raid/Root nohz=off highres=off

8 drives: max 27MB/s (per drive)
11 drives: max 20MB/s (per drive)

Kernel params had no effect. The stats were the same for "nohz=off" only and for "iosched=as" only. Is there any way to verify that the kernel has acted on these params?

----

At 20MB/s per drive when reading from 11 drives, the total aggregate transfer is 220MB/s. This is on a 1.8GHz Athlon64. That works out to 81 clock cycles per byte transferred from disk! I would expect brute force polling to be much more efficient that this. Am I crazy?

Comment 4 Maxim Egorushkin 2009-06-22 21:55:42 UTC

I am having the same problem.

Copying to a USB attached hardisk mounted as /dev/sdb also causes high cpu load.

My system is Intel E8400, X48 chipset, Intel ICH9 family i/o.

Can provide more info.

Comment 5 Maxim Egorushkin 2009-07-04 23:57:14 UTC

I wonder if there is any new info regarding this issue. Any workaround may be?

Comment 6 Maxim Egorushkin 2009-07-05 18:20:03 UTC

High CPU is due to iowait time on my system.

It seems to be a known issue:

http://lkml.org/lkml/2009/1/16/487

Comment 7 Chuck Ebbert 2009-07-21 16:44:06 UTC

(In reply to comment #6)
> High CPU is due to iowait time on my system.
> 
> It seems to be a known issue:
> 
> http://lkml.org/lkml/2009/1/16/487  

iowait time is not counted as part of the cpu load.

Comment 8 Chuck Ebbert 2009-07-21 16:47:49 UTC

Has anyone tried changing the cpufreq driver from 'ondemand' to 'performance'?

# echo 'performance' >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

Comment 9 Maxim Egorushkin 2009-07-21 20:30:16 UTC

(In reply to comment #7)
> (In reply to comment #6)
> > High CPU is due to iowait time on my system.
> > 
> > It seems to be a known issue:
> > 
> > http://lkml.org/lkml/2009/1/16/487  
> 
> iowait time is not counted as part of the cpu load.  

It shows high iowait time when I write to a USB drive. Is high iowait time expected in this case or not?

[root@truth ~]# time dd bs=1G count=1 if=/dev/zero of=/media/BACKUP/zero0.bin
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 37.2891 s, 28.8 MB/s

real	0m37.417s
user	0m0.000s
sys	0m2.160s

[root@truth ~]# vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0  12376 1340368   6720 182620    0    0    68    43  134   49  1  1 97  1  0	
 0  0  12376 1340304   6720 182676    0    0     0     4  678 1505  4  1 95  0  0	
 0  0  12376 1339884   6736 182904    0    0     0   384  691 1660  4  1 95  0  0	
 1  0  12376 1339932   6736 182920    0    0     0     0  467  968  3  1 96  0  0	
 1  0  12376 275576   6744 195560    0    0    21     4 1603 1238  3 31 64  1  0	
 0  8  12376  12856   4720 407980    0    0    31 31213 1770 3078  2 17 11 70  0 <--- dd started
 1  2  12392  16668   4688 405760    0   16     1 18355 1216 2682  2  2 20 75  0	
 2  4  12392  15948   4688 407232    0    0     6 30716 1389 3344  2  3  0 95  0	
 0  4  12392  15352   4692 407588    0    0     1 29468 1338 3281  1  3  0 95  0	
 1  4  12388  16368   4712 406868   28    0   285 29176 1410 3447  2  1 33 63  0	
 1  4  12388  15376   4716 407632    0    0     1 30456 1338 3288  2  2  2 94  0	
 7  0  12388  14680   4716 408368    0    0     1 27051 1359 3196  2  2 13 82  0	
 1  0  12388  16608   4724 406732    0    0     3 24582 1354 3108  1  5 47 47  0	
 2  3  12388  15800   4732 407368    0    0     7 33736 1407 3104  2  6 47 46  0	
 0  2  12388  17240   4760 406232    0    0    97 22460 1474 3149  2 10 28 60  0	
 0  4  12388  16716   4764 407012    0    0    94 29128 1342 3243  1  3 47 49  0	
 1  4  12388  17004   4764 406580    0    0     1 29132 1324 3163  2  3 48 47  0	
 1  4  12388  16036   4768 407312    0    0     1 29128 1296 3201  1  2 49 49  0	
 0  4  12388  16940   4768 406236    0    0     1 29128 1259 3217  1  2 49 48  0	
 0  3  12388  15616   4768 407692    0    0     6 29152 1457 3457  3  4 34 59  0	
 1  2  12388  16604   4772 406732    0    0     1 31048 1324 3208  1  2  3 94  0	
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  2  12388  16436   4772 406888    0    0     1 29120 1316 3230  1  2  0 97  0	
 1  2  12388  16364   4776 407036    0    0     1 29128 1344 3229  2  2  0 96  0	
 1  2  12388  16652   4780 406644    0    0     1 29128 1333 3214  1  3  0 96  0	
 0  4  12388  16996   4824 406356    0    0    31 24284 1662 3497  2 11  0 87  0	
 0  4  12388  16616   4828 406660    0    0     1 29132 1381 3331  2  2  0 96  0	
 1  1  12388  15524   4828 407888    0    0     1 26480 1275 3111  1  2  4 93  0	
 1  2  12388  16140   4836 407156    0    0     4 32884 1431 3187  1  5 48 46  0	
 0  4  12388  16004   4852 407312    0    0    20 22884 1596 3266  2 13 43 43  0	
 0  4  12388  15464   4852 407668    0    0     1 29132 1329 3295  2  2 49 47  0	
 0  2  12388  16764   4864 406688    0    0     2 30468 1294 3307  1  2 49 48  0	
 0  3  12388  16056   4868 407144    0    0     1 29132 1385 3295  2  3 44 50  0	
 2  3  12384  16340   4868 406996   28    0    29 30452 1374 3258  1  3 45 51  0	
 0  1  12384  16128   4872 407156    0    0     1 27804 1334 3326  1  2 47 49  0	
 0  2  12384  15424   4876 407496    0    0     3 31308 1351 3118  2  3 48 47  0	
 0  2  12384  15300   4864 407780    0    0     2 27804 1312 3164  1  4  0 95  0	
 1  2  12384  15740   4876 407312    0    0     1 30472 1242 3110  1  2  0 97  0	
 1  2  12384  16136   4876 406904    0    0     1 29128 1331 3239  2  2  0 96  0	
 1  2  12384  15708   4880 407392    0    0   385 29128 1306 3097  1  2  0 97  0	
 1  1  12384  16260   4884 407024    0    0     1 23832 1366 3288  2  4  7 87  0	
 0  2  12384  15668   4900 407548    0    0    15 38400 1344 3114  3  3 41 52  0	
 0  0  12384 1065896   4900 407560    0    0     0 15529 1455 3060  2  7 27 65  0	
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0  12384 1066036   4900 407560    0    0     0     0  487 1109  1  1 98  0  0 <--- dd finished	
 1  0  12384 1066036   4908 407560    0    0     0    16  454 1037  2  1 97  0  0	
 1  0  12384 1065664   4912 407940    0    0   388     0  450  906  2  0 96  1  0	
 0  0  12380 1065080   4928 408548    4    0   404   388  487 1017  1  1 94  4  0	
 0  0  12380 1065044   4928 408604    0    0     0    64  600 1236  4  1 95  0  0

Comment 10 Jeff Moyer 2009-07-27 16:24:18 UTC

> It shows high iowait time when I write to a USB drive. Is high iowait time
> expected in this case or not?

Yes, high I/O wait time is expected when performing I/O.  I/O wait is just a special type of idle, so high I/O wait time is not generally indicative of a problem.

Comment 11 Maxim Egorushkin 2009-08-04 08:37:17 UTC

My issue looks to be a separate one. I've opened another bug report https://bugzilla.redhat.com/show_bug.cgi?id=515191

Comment 12 Chuck Ebbert 2009-09-24 21:15:05 UTC

There is a fix in the upcoming kernel 2.6.32 that's been backported to Fedora rawhide (F-12) and it's going into F-11 too:

http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=69d25870f20c4b2563304f2b79c

It should be in the 2.6.30.8 F-11 kernels.

Comment 13 Bug Zapper 2009-11-18 11:39:52 UTC

This message is a reminder that Fedora 10 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 10.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '10'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 10's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 10 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 14 Bug Zapper 2009-12-18 09:10:22 UTC

Fedora 10 changed to end-of-life (EOL) status on 2009-12-17. Fedora 10 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.