Description of problem: Any async IO process will, under heavy IO load, spinlock while waiting for kvec function with kernel aio subsystem. IOwait increases to 70-80% of total CPU time and performance of aio is worse than sync io. During this test low memory is not constrained (Oracle SGA allocated to hugeTLB, so 16G config isn't nearly as tight as is typical on smp kernel). Version-Release number of selected component (if applicable): HP ProLiant DL740 G1 Linux RedHat_3 2.4.21-27.0.2.ELsmp 8x2998MHz (Intel(R)_Xe) 15.7Gb memory 4.9Tb filesystem (1.2Tb free) EMC PowerPath 4.3.1-036 Qlogic 23xx 7.03 driver How reproducible: Any aio workload will create this, but the easiest way to reproduce is: Oracle 9.2.0.6 async and directio enabled 2 dbwr processes create 100GB tablespace Actual results: Time to complete is > than sync io. IOwait 70-80% of cpu (basically 7 cpu's waiting for kernel process on 1 cpu) Expected results: aio should be faster iowait should be minimal.
In RHEL 3, the O_DIRECT flag is not supported with AIO. So, when doing AIO to file systems, it is not uncommon to get worse performance than SYNC I/O. One reason for this is that the AIO code path does not do readahead. It would be more useful to get information on how AIO performs on an 8-way system as compared to a 4-way. If the performance degrades when adding CPUs, then we can flag this as a scalability issue. Do you have profiling data for this? Can you point to specific code paths or locks that are under contention?
Hi, the directio/asyncio config reference was for Oracle (which puts different processes in different modes, i.e. lgwr=direct and dbwr=async though it is unclear if dbwr is able to switch between modes). While the general case you're stating is true (async reads after write in a o_direct model will be slower because the data is not in page cache) it should not affect this particular test case as Oracle tends not to go to read information if it is in pagecache (as its cached in SGA the disk io request never gets initiated). There are other more 'real world' work loads that tweak this behavior, but this seems to be the easiest to duplicate. I'll try to dig up the strace data we got, but the scalability comment was coming from RvR when we shared the trace with him several months ago. Iain Grey just asked me to open this as a tracking bug so they could test in Westford. tx Steve
I think you may have misinterpreted what I said above. In the event that a process doing AIO specifies the O_DIRECT flag, the O_DIRECT flag is ignored. Therefore, all reads and writes go through the page cache, incurring all of the extra overhead associated with that. In the AIO code path for file systems, we do not perform readahead, which is something that you would get without AIO. In our experience with RHEL 3, AIO on file systems is slower than SYNC I/O to file systems. I cannot speak to how the Oracle processes perform their I/O for the above configuration (i.e. whether the dbwriter or logwriter does O_DIRECT + aio). I will look into this further to see if I can verify what you said above. We currently have an 8-way system setup in our lab for testing. I would like clarification on one thing. You mention that you created a 100GB table space. Is it the operation of creating the table space that took longer than the sync io case? So to reproduce, we need only create the table space? Also, one last quick question. Does "RvR" refer to Rik van Riel? Thanks in advance.
FYI - we too have measured Oracle performance of TPC-C mini benchmarks on a 2-cpu/2-ht system when using aio+dio on rhel3 to ext3 fileystems; http://people.redhat.com/dshaks/.perf/results_oracle/tpcc_ora10g_rhel4_u2.sxc Thus we should NOT need the 8-cpu system to reproduce. Our suggestion is to use DIO but not AIO for RHEL3 Oracle EXT3 databases where the database size exceeds the size of memory. In the case above, we had a 36 GB database on our 4-cpu system with 4.0 GB of memory.
In a quick test of my 'create tablespace' suggestion, the results are: aio+dio: 37:28.31 aio only: 22:10.02 dio only: 40:17.36 none: 25:11.49 which would seem to provide a contrary opinion.... Having said that, the block readahead features of rhel4 might turn this all on its head. RvR is indeed Rik. Creating the tablespace is sufficient to create the problem condition. There are more realistic and complex ways of creating the situation, but that's a good start. I'll try to create another case without Oracle. Also, since it is relevant: /proc/sys/fs/aio-max-size 2147483648 tx Steve
though in this specific test, async did beat none and trounced o_dsync and thus is a slightly atypical result ;-).
I rechecked the oracle image and found we needed to link it with the aio libs, Once done, the data is roughly the same as Joe's earlier data; 1) SIO (sync) performing best, 21m8s 2) AIO (no dio) 2nd 26m5s 3) DIO only (no aio) 3rd 27m10s * AIO+DIO (effectivelty shuts off DIO) same as AIO IO is still happening at 70 MB/sec but for a 200 MB/sec fc, which can do 150-180 MB/sec aio this is a hard performance hit. Looks like on 8-cpu we are out of gas with keventd being only 1 process to take care of aio request (see top output for cpu7 keventd pegged) Also checkout the profile data if you want to see if we can speed up memcpy_from_kvec_dst? Shak [root@xeon2 aio]# more orabuild_top.txt 95 processes: 93 sleeping, 2 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 2.3% 0.0% 11.7% 0.0% 0.6% 8.0% 77.2% cpu00 8.5% 0.0% 0.3% 0.0% 0.0% 0.0% 91.0% cpu01 0.0% 0.0% 0.0% 0.0% 0.0% 2.1% 97.8% cpu02 9.1% 0.0% 3.5% 0.0% 5.1% 0.0% 82.0% cpu03 0.7% 0.0% 6.5% 0.0% 0.0% 39.2% 53.3% cpu04 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% cpu05 0.0% 0.0% 4.9% 0.0% 0.0% 0.0% 95.0% cpu06 0.1% 0.0% 0.7% 0.0% 0.0% 0.1% 98.8% cpu07 0.0% 0.0% 77.2% 0.0% 0.0% 22.7% 0.0% Mem: 8205784k av, 8190420k used, 15364k free, 0k shrd, 15552k buff 6095448k actv, 1168128k in_d, 132980k in_c Swap: 2040244k av, 0k used, 2040244k free 7555784k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 10 root 25 0 0 0 0 RW 9.6 0.0 4:09 7 keventd 30223 oracle 16 0 21584 14M 9068 S 2.3 0.1 1:19 2 oracle 19 root 18 0 0 0 0 SW 1.0 0.0 4:56 5 kswapd 3497 root 16 0 0 0 0 SW 0.7 0.0 4:04 3 kjournald 1 root 15 0 508 508 444 S 0.0 0.0 0:13 0 init 2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0 profile: 0000000000000000 total 45258 0.0263 0000000002109100 default_idle 38813 485.1625 0000000002144c60 memcpy_from_kvec_dst 3540 5.8224 0000000002155020 launder_page 160 0.0524 00000000022a4220 __br_write_lock 136 2.1250 000000000214cc90 __iodesc_write_page 131 0.1300 00000000021685a0 get_unused_buffer_head 113 0.4708 0000000002148ce0 unlock_page 87 0.7768 0000000002126a23 .text.lock.sched 82 0.1320 0000000002147bf0 __remove_inode_page 71 0.4931 0000000002156680 rebalance_laundry_zone 68 0.0264 00000000022a4260 __br_write_unlock 66 1.7838 00000000021548ab .text.lock.swap 66 0.5641 0000000002140780 follow_page 66 0.0897 0000000002159440 rmqueue 63 0.0729 Wed Aug 17 11:58:51 EDT 2005 procs memory swap io system cpu r b swpd free buff cache si so bi bo in cs us sy wa id 2 3 0 19156 409088 7157528 0 0 1 9 5 5 0 0 0 4 1 1 0 16444 409088 7160508 0 0 0 76010 882 523 4 15 6 75 3 1 0 15508 409088 7161228 0 0 2 70134 837 489 3 15 6 76 Linux 2.4.21-34.ELhugemem (xeon2.lab.boston.redhat.com) 08/17/2005 avg-cpu: %user %nice %sys %iowait %idle 3.45 0.00 15.46 5.52 75.57 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util ida/c0d1 0.50 2.60 0.10 0.90 4.80 28.00 2.40 14.00 32.80 0.07 73.00 71.00 7.10 ida/c0d1p1 0.50 2.60 0.10 0.90 4.80 28.00 2.40 14.00 32.80 0.07 73.00 71.00 7.10 sdf 0.00 16401.20 0.00 1128.50 0.00 140240.00 0.00 70120.00 124.27 248.67 220.67 0.50 55.90 sdf1 0.00 16401.20 0.00 1128.50 0.00 140240.00 0.00 70120.00 124.27 248.67 220.67 0.50 55.90
Any update?
While the data does show a lot of CPU time taken up by keventd, it does not show a decrease in I/O throughput for the system. I would like to know precisely why this bugzilla has been filed. The summary mentions high I/O wait time during heavy I/O, and his is exactly what you get under heavy I/O; the O/S has to wait for I/Os to complete. Now, what precisely is meant by poor performance? From further comments in the bug, it sounds like the real concern is that AIO doesn't buy you much, if anything, under RHEL 3, when used with file systems. We have corroborated that data, and provided some advice on tuning. So, what is the problem being reported?
Hi, Fundamentally, we're seeing two behaviors, one iowait is going up long before it should and absolute IO performance is going down. From comment 13, your testing is showing roughly a 50% decrease in throughput(compared to sync) and net performance that is 30-40% slower (which I would categorize as a problem). I can see the behavior with as little as 5-10MB/sec in throughput (to a 10 disk strip with 4 200M channels to a dedicated array with 4G of cache). Basically looks like the 7 processors are waiting on kevent which seems to write out less efficiently than it should and thus makes kio slower than standard sync (which it shouldn't be). If you've provided any advice on tuning (other than to suggest turning on dio which is the slowest of all the tests that either of us have performed) it must be in a private comment. Happy to test any tuning suggestions you have once I can see them. tx Steve
Steve, OK, thanks for clarification. I'll work on a patch to remove some of the bottleneck from keventd. No timeframes yet on availability, but I'll let you know when I have something that works. Thanks, Jeff
Output from new kernel; 2.4.21-34.ELaio.percpu.1hugemem #1 SMP Thu Aug 25 16:30:16 EDT A few changes since the last runs; using 2 db writers and hughmem kernels for all runs (prior report had sync IO using 1 db writer). bash-2.05b$ grep real *3.txt cr_jeffm_sync3.txt:real 25m11.812s = 1511.81 s cr_jeffm_async3.txt:real 26m31.007s = 1591.01 s Thus Jeff's patch has effectively removed the time in keventd0 out to the aio threads which are no longer saturating a single cpu. Without faster IO underneath the lun, the overall performance of the tablespace create has remains within 5.2% of the baseline result (1591s/1511s). (Please correct previous runtime baseline for 2-dbwriters to 25m39.046s) In both cases, using more dbwriters and/or using asynchronous multi-threaded IO does not help improve the performance of an single file database create. Other loads which are writing to multiple files and file systems or on systems with IO capabilities may benefit more from AIO. Shak Using Oracle aio_only; 92 processes: 89 sleeping, 3 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 2.8% 0.0% 11.6% 0.0% 0.4% 8.9% 76.1% cpu00 0.7% 0.0% 2.5% 0.0% 0.0% 0.3% 96.2% cpu01 0.0% 0.0% 7.5% 0.0% 0.0% 4.5% 87.8% cpu02 0.0% 0.0% 7.7% 0.0% 1.9% 50.6% 39.5% cpu03 0.0% 0.0% 1.7% 0.0% 1.3% 0.0% 96.8% cpu04 17.7% 0.0% 19.9% 0.0% 0.0% 5.9% 56.3% cpu05 0.9% 0.0% 19.5% 0.0% 0.0% 4.1% 75.2% cpu06 2.9% 0.0% 23.3% 0.0% 0.0% 2.5% 71.1% cpu07 0.0% 0.0% 10.7% 0.0% 0.0% 3.3% 85.8% Mem: 8205772k av, 946812k used, 7258960k free, 0k shrd, 12888k buff 841928k active, 10856k inactive Swap: 2040244k av, 0k used, 2040244k free 800172k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 29 root 16 0 0 0 0 SW 2.8 0.0 0:01 6 aio/6 1861 oracle 24 0 29904 22M 17396 R 2.8 0.2 0:02 4 oracle 28 root 16 0 0 0 0 RW 2.4 0.0 0:02 5 aio/5 27 root 16 0 0 0 0 SW 2.3 0.0 0:02 4 aio/4 30 root 16 0 0 0 0 SW 1.2 0.0 0:00 7 aio/7 24 root 16 0 0 0 0 SW 0.9 0.0 0:00 1 aio/1 1791 root 15 0 0 0 0 SW 0.9 0.0 0:00 2 kjournald 10 root 15 0 0 0 0 SW 0.3 0.0 0:00 3 keventd 1 root 15 0 508 508 448 S 0.0 0.0 0:08 2 init 2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0 -bash-2.05b$ iostat -x 10 | grep sdf1 sdf1 0.03 767.93 1.04 27.10 14.33 6368.66 7.17 3184.33 226.83 12.06 425.79 1.39 3.91 sdf1 0.00 15211.20 14.70 544.00 116.80 126147.20 58.40 63073.60 226.00 220.51 393.81 1.03 57.50 sdf1 0.00 16784.50 13.50 701.50 108.80 139723.20 54.40 69861.60 195.57 134.84 198.43 1.03 74.00 sdf1 0.00 11036.60 15.60 597.60 124.00 93309.60 62.00 46654.80 152.37 165.26 263.86 0.66 40.60 sdf1 0.00 13632.20 13.80 705.40 110.40 114704.80 55.20 57352.40 159.64 259.45 360.74 0.90 64.40 sdf1 0.00 12817.10 14.00 771.70 112.00 108712.80 56.00 54356.40 138.51 213.49 272.43 0.67 52.90 sdf1 0.00 13701.60 14.30 925.20 114.40 117057.60 57.20 58528.80 124.72 227.56 243.78 0.58 54.10 sdf1 0.00 14114.90 14.70 892.40 117.60 120079.20 58.80 60039.60 132.51 229.90 251.84 0.61 54.90 sdf1 0.00 13628.97 13.79 743.66 110.29 114965.03 55.14 57482.52 151.93 219.74 289.27 0.71 53.85 sdf1 0.00 13229.60 14.30 670.60 114.40 111203.20 57.20 55601.60 162.53 213.42 311.70 0.78 53.1 With SYNC IO: 92 processes: 90 sleeping, 2 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 1.6% 0.0% 10.7% 0.0% 5.0% 1.1% 81.2% cpu00 0.0% 0.0% 5.7% 0.0% 0.0% 0.0% 94.2% cpu01 0.9% 0.0% 1.9% 0.0% 0.0% 0.9% 96.1% cpu02 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% cpu03 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% cpu04 12.5% 0.0% 78.8% 0.0% 0.0% 8.6% 0.0% cpu05 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% cpu06 0.0% 0.0% 0.0% 0.0% 40.3% 0.0% 59.6% cpu07 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% Mem: 8205772k av, 8190496k used, 15276k free, 0k shrd, 49944k buff 258108k actv, 5866336k in_d, 133168k in_c Swap: 2040244k av, 0k used, 2040244k free 7791572k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 2052 oracle 25 0 29692 22M 17180 R 11.3 0.2 5:05 4 oracle 19 root 16 0 0 0 0 SW 0.7 0.0 2:06 0 kswapd 2085 oracle 20 0 1124 1124 892 R 0.2 0.0 0:00 1 top 1 root 15 0 304 304 244 S 0.0 0.0 0:08 1 init 2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0
Thanks for this, looks very promising. Can I get a link to download an SMP version of this kernel (we don't use hugemem when other 3rd party software is involved)? We'll stick it in the more 'real world' workloads and let you know how it does when aio is unleashed. Related question, I assume both of our dio tests were so bad (compared to sync) because the tablespace create does lots of re-reads (haven't traced the process so don't know this to be fact)? Its a logical assumption, but wondered if any of your tracing showed that behavior? Lastly, we're fairly late in the beta cycle here, but is the intent to roll this into u6? tx Steve tx Steve
I'll kick of an SMP build, but please note that this change has NOT been checked into CVS and is NOT a supported build. I'll attach a copy of the patch to this bugzilla, too, so there is some record of it. Note that the patch, in its current state, is not acceptable for merging. There are some known issues with UP builds that will have to be addressed. In order to put a patch like this into the next update we would need a substantial amount of testing under varied workloads to ensure it does not introduce regressions. Do you have a baseline to compare this kernel to? How are you going to quantify the performance impact?
Created attachment 118304 [details] Preliminary per-cpu aio thread implementation Here is the patch that was used during the db create runs that shak benchmarked.
You can find the smp kernel rpm for i686 at: http://people.redhat.com/.bz165687/kernel-smp-2.4.21-34.ELaio.percpu.1.i686.rpm Please let us know how your testing goes. Thanks.
Steve, in answer to the last question in comment #19, U6 is already closed.
(In reply to comment #22) > You can find the smp kernel rpm for i686 at: > > http://people.redhat.com/.bz165687/kernel-smp-2.4.21-34.ELaio.percpu.1.i686.rpm Of course, I meant: http://people.redhat.com/jmoyer/.bz165687/kernel-smp-2.4.21-34.ELaio.percpu.1.i686.rpm
Hi, We'll download it today, stick it together with powerpath and potentially get some results this afternoon/early next week. The tests that we'll use are: - parallel insert test, sql load of 2, 4, 8, 16, 32 clients doing 10,000 4k inserts each (though as this gates mostly on lgwr don't expect a huge improvement here). - hammerora (models tpc-c) - potentially parallel index rebuilds on some big tables, should stress the dbwr processes without gating on lgwr (which always does sync io). tx Steve
Looking at this a bit more, the directio stuff looks a bit dodgey too. dio slower than sync io and both sets show almost no read activity (and thus presumeably no re-read activity). Any thoughts on why? Certainly seems to diverge from Shak's testing of rhel4 (which is obviously different in more ways than it is useful to count). tx Steve
Sorry for the delay on this. I was out for vacation and the guys didn't have enough free time to test in my absence. Will try to get something out this week. Any thoughts on the dio vs sync discrepancy? Tx Steve
Hi, Here's some of the long awaited results. Summary, better, but still odd. Looks like in our test (unlike yours) the aio threads are added, but only 2 of them get dispatched to (thus shrinking aio time by half, but still leaving it much higher than everything else). Did you integrate your kernel with PowerPath? We suspect that's making both the original situation and the testing of your fix less than optimal. here's the data (sorry for the long post), 10GB tablespace rather than 100 to save time: Summary ------- Neither Direct IO or Async Elapsed: 00:05:28.39 Direct IO Only Elapsed: 00:06:30.73 Async IO only Elapsed: 00:20:01.98 Regards Jeremy -------------------------------------------------------------------------------------------- Direct IO Only Elapsed: 00:06:30.73 xldn1336por(parkerjb)::/$ iostat -x 30|grep emcpowerm1 emcpowerm1 1.34 1828.08 0.01 0.00 61.12 16300.61 30.56 8150.31 1700766.53 1.79 0.25 103271.67 99.35 emcpowerm1 0.00 1.77 0.00 0.00 0.53 54.93 0.27 27.47 0.00 831.94 0.00 0.00 100.00 emcpowerm1 0.00 3.10 0.00 0.00 0.00 117.07 0.00 58.53 0.00 929.08 0.00 0.00 100.00 emcpowerm1 0.00 1.20 0.00 0.00 0.00 26.40 0.00 13.20 0.00 375.04 0.00 0.00 100.00 emcpowerm1 0.00 1.23 0.00 0.00 1785.13 127.33 892.57 63.67 0.00 336.05 0.00 0.00 100.00 emcpowerm1 0.00 3055.27 0.00 0.00 468.00 45179.60 234.00 22589.80 0.00 5.16 0.00 0.00 100.00 emcpowerm1 0.00 3937.70 0.00 0.00 24.80 63556.20 12.40 31778.10 0.00 78.67 0.00 0.00 100.00 emcpowerm1 0.00 1577.30 0.00 0.00 4.27 33154.80 2.13 16577.40 0.00 18.88 0.00 0.00 100.00 emcpowerm1 0.00 1268.37 0.00 0.00 21.33 27344.17 10.67 13672.08 0.00 408.74 0.00 0.00 100.00 emcpowerm1 0.00 632.17 0.00 0.00 5.87 13291.37 2.93 6645.68 0.00 1129.82 0.00 0.00 100.00 emcpowerm1 0.00 3134.77 0.00 0.00 17.07 29010.20 8.53 14505.10 0.00 1037.75 0.00 0.00 100.00 emcpowerm1 0.00 2713.47 0.00 0.00 3.73 22705.67 1.87 11352.83 0.00 74.73 0.00 0.00 100.00 emcpowerm1 0.00 3118.13 0.00 0.00 20.27 26067.93 10.13 13033.97 0.00 476.41 0.00 0.00 100.00 emcpowerm1 0.00 2960.83 0.00 0.00 4.27 24739.27 2.13 12369.63 0.00 605.07 0.00 0.00 100.00 emcpowerm1 0.00 10782.07 0.00 0.00 57.33 89999.27 28.67 44999.63 0.00 400.00 0.00 0.00 100.00 emcpowerm1 0.00 12510.30 0.00 0.00 248.80 104398.47 124.40 52199.23 0.00 204.41 0.00 0.00 100.00 emcpowerm1 0.00 12513.27 0.00 0.00 8.00 104446.47 4.00 52223.23 0.00 617.52 0.00 0.00 100.00 emcpowerm1 0.00 12538.67 0.00 0.00 19.20 104641.80 9.60 52320.90 0.00 1272.01 0.00 0.00 100.00 emcpowerm1 0.00 1866.74 0.00 0.00 72.78 15659.11 36.39 7829.56 0.00 1055.58 0.00 0.00 99.97 emcpowerm1 0.00 1.40 0.00 0.00 16.00 74.93 8.00 37.47 0.00 1051.46 0.00 0.00 100.00 emcpowerm1 0.00 1.40 0.00 0.00 4.27 49.73 2.13 24.87 0.00 206.68 0.00 0.00 100.00 emcpowerm1 0.00 1.40 0.00 0.00 16.53 51.93 8.27 25.97 0.00 14.35 0.00 0.00 100.00 emcpowerm1 0.00 1.40 0.00 0.00 4.27 49.67 2.13 24.83 0.00 146.39 0.00 0.00 100.00 emcpowerm1 0.00 1.40 0.00 0.00 16.00 49.73 8.00 24.87 0.00 1100.26 0.00 0.00 100.00 -------------------------------------------------------------------------------------------- Using async IO only Elapsed: 00:20:01.98 xldn1336por(parkerjb)::/$ iostat -x 30|grep emcpowerm1 emcpowerm1 1.45 1913.43 0.01 0.00 66.11 16588.80 33.05 8294.40 1573317.49 0.81 0.25 93789.69 99.28 emcpowerm1 13.14 5422.64 0.00 0.00 422.01 71277.09 211.00 35638.55 0.00 310.72 0.00 0.00 100.03 emcpowerm1 0.03 416.93 0.00 0.00 5.33 18970.67 2.67 9485.33 0.00 128.49 0.00 0.00 100.00 emcpowerm1 0.00 179.14 0.00 0.00 1.60 6596.47 0.80 3298.23 0.00 156.17 0.00 0.00 99.97 emcpowerm1 0.00 87.67 0.00 0.00 0.80 5608.27 0.40 2804.13 0.00 622.60 0.00 0.00 100.00 emcpowerm1 0.00 88.73 0.00 0.00 0.80 5514.13 0.40 2757.07 0.00 160.67 0.00 0.00 100.00 emcpowerm1 0.00 86.93 0.00 0.00 0.80 5068.80 0.40 2534.40 0.00 77.42 0.00 0.00 100.00 emcpowerm1 0.00 98.20 0.00 0.00 0.53 5133.33 0.27 2566.67 0.00 814.05 0.00 0.00 100.00 emcpowerm1 0.00 125.53 0.00 0.00 0.80 5850.93 0.40 2925.47 0.00 358.20 0.00 0.00 100.00 emcpowerm1 0.00 147.23 0.00 0.00 0.80 5952.53 0.40 2976.27 0.00 123.99 0.00 0.00 100.00 emcpowerm1 0.00 194.37 0.00 0.00 1.87 7236.00 0.93 3618.00 0.00 242.19 0.00 0.00 100.00 emcpowerm1 0.03 174.33 0.00 0.00 4.53 6754.93 2.27 3377.47 0.00 167.82 0.00 0.00 100.00 emcpowerm1 0.00 147.87 0.00 0.00 1.33 6210.40 0.67 3105.20 0.00 373.88 0.00 0.00 100.00 emcpowerm1 0.00 138.23 0.00 0.00 0.53 6503.73 0.27 3251.87 0.00 401.56 0.00 0.00 100.00 emcpowerm1 0.00 171.87 0.00 0.00 0.80 7093.60 0.40 3546.80 0.00 145.92 0.00 0.00 100.00 emcpowerm1 0.00 239.27 0.00 0.00 0.53 6769.07 0.27 3384.53 0.00 661.91 0.00 0.00 100.00 emcpowerm1 0.00 637.53 0.00 0.00 2.40 10381.33 1.20 5190.67 0.00 405.76 0.00 0.00 100.00 emcpowerm1 0.00 1239.13 0.00 0.00 2.67 12441.07 1.33 6220.53 0.00 954.73 0.00 0.00 100.00 emcpowerm1 0.00 85.23 0.00 0.00 1.33 6705.87 0.67 3352.93 0.00 414.92 0.00 0.00 100.00 emcpowerm1 0.00 97.87 0.00 0.00 1.33 7205.60 0.67 3602.80 0.00 172.06 0.00 0.00 100.00 emcpowerm1 0.00 400.37 0.00 0.00 11.20 11632.92 5.60 5816.46 0.00 838.73 0.00 0.00 99.97 emcpowerm1 0.17 849.63 0.00 0.00 54.13 20847.73 27.07 10423.87 0.00 349.42 0.00 0.00 100.00 emcpowerm1 0.00 371.30 0.00 0.00 9.87 7137.87 4.93 3568.93 0.00 410.53 0.00 0.00 100.00 emcpowerm1 0.00 316.50 0.00 0.00 8.27 7054.13 4.13 3527.07 0.00 62.71 0.00 0.00 100.00 emcpowerm1 0.00 273.70 0.00 0.00 7.47 7039.47 3.73 3519.73 0.00 211.97 0.00 0.00 100.00 emcpowerm1 0.10 322.97 0.00 0.00 15.73 7799.47 7.87 3899.73 0.00 672.85 0.00 0.00 100.00 emcpowerm1 0.00 577.83 0.00 0.00 7.73 9243.47 3.87 4621.73 0.00 29.68 0.00 0.00 100.00 emcpowerm1 0.00 1051.40 0.00 0.00 8.00 11900.27 4.00 5950.13 0.00 31.37 0.00 0.00 100.00 emcpowerm1 0.00 1157.30 0.00 0.00 2.40 14036.80 1.20 7018.40 0.00 818.47 0.00 0.00 100.00 emcpowerm1 0.00 894.33 0.00 0.00 2.67 12834.40 1.33 6417.20 0.00 160.16 0.00 0.00 100.00 emcpowerm1 0.00 801.63 0.00 0.00 0.53 11448.80 0.27 5724.40 0.00 157.34 0.00 0.00 100.00 emcpowerm1 0.00 738.40 0.00 0.00 1.07 11672.53 0.53 5836.27 0.00 460.79 0.00 0.00 100.00 emcpowerm1 0.00 1159.67 0.00 0.00 0.53 18200.80 0.27 9100.40 0.00 140.92 0.00 0.00 100.00 emcpowerm1 0.00 2565.10 0.00 0.00 0.00 43179.47 0.00 21589.73 0.00 133.86 0.00 0.00 100.00 emcpowerm1 0.00 1500.93 0.00 0.00 0.00 33383.47 0.00 16691.73 0.00 422.59 0.00 0.00 100.00 emcpowerm1 0.00 1196.77 0.00 0.00 0.00 26195.47 0.00 13097.73 0.00 243.73 0.00 0.00 100.00 emcpowerm1 0.00 654.92 0.00 0.00 0.27 13652.52 0.13 6826.26 0.00 591.95 0.00 0.00 99.97 emcpowerm1 0.00 3958.87 0.00 0.00 0.00 34173.07 0.00 17086.53 0.00 504.45 0.00 0.00 100.00 emcpowerm1 0.00 2770.70 0.00 0.00 0.00 22968.53 0.00 11484.27 0.00 353.74 0.00 0.00 100.00 emcpowerm1 0.00 3341.63 0.00 0.00 0.00 27723.73 0.00 13861.87 0.00 674.83 0.00 0.00 100.00 emcpowerm1 0.00 3463.19 0.00 0.00 0.00 28706.73 0.00 14353.36 0.00 562.95 0.00 0.00 100.03 emcpowerm1 0.00 11889.89 0.00 0.00 0.00 98695.93 0.00 49347.97 0.00 611.48 0.00 0.00 99.97 emcpowerm1 0.00 1.77 0.00 0.00 0.27 57.87 0.13 28.93 0.00 216.47 0.00 0.00 100.00 emcpowerm1 0.00 1.73 0.00 0.00 0.00 53.07 0.00 26.53 0.00 58.16 0.00 0.00 100.00 Three sample top outputs: 14:13:54 up 4:33, 5 users, load average: 3.42, 3.54, 2.68 219 processes: 218 sleeping, 1 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 0.9% 0.0% 2.1% 0.0% 0.0% 43.9% 52.8% cpu00 0.3% 0.0% 0.7% 0.0% 0.0% 98.8% 0.0% cpu01 0.1% 0.0% 1.1% 0.0% 0.0% 96.4% 2.1% cpu02 0.1% 0.0% 13.7% 0.0% 0.1% 85.8% 0.0% cpu03 0.1% 0.0% 0.1% 0.0% 0.0% 70.4% 29.1% cpu04 6.3% 0.0% 0.5% 0.0% 0.1% 0.0% 92.8% cpu05 0.1% 0.0% 0.1% 0.0% 0.1% 0.0% 99.4% cpu06 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 99.8% cpu07 0.0% 0.0% 0.3% 0.0% 0.0% 0.0% 99.6% Mem: 16423520k av, 13852980k used, 2570540k free, 0k shrd, 38520k buff 8128736k actv, 333468k in_d, 359212k in_c Swap: 8388600k av, 0k used, 8388600k free 9047436k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 25 root 15 0 0 0 0 DW 1.5 0.0 1:31 2 aio/2 24929 oracle 15 0 1155M 1.1G 1143M S 0.8 7.1 0:39 4 oracle 21 root 15 0 0 0 0 DW 0.1 0.0 1:42 1 bdflush 1 root 15 0 524 524 460 S 0.0 0.0 0:06 6 init 2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0 3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1 4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2 migration/2 5 root RT 0 0 0 0 SW 0.0 0.0 0:00 3 migration/3 6 root RT 0 0 0 0 SW 0.0 0.0 0:00 4 migration/4 7 root RT 0 0 0 0 SW 0.0 0.0 0:00 5 migration/5 8 root RT 0 0 0 0 SW 0.0 0.0 0:00 6 migration/6 9 root RT 0 0 0 0 SW 0.0 0.0 0:00 7 migration/7 10 root 15 0 0 0 0 SW 0.0 0.0 0:00 5 keventd 11 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0 12 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1 13 root 34 19 0 0 0 SWN 0.0 0.0 0:00 2 ksoftirqd/2 14 root 39 19 0 0 0 SWN 0.0 0.0 0:00 3 ksoftirqd/3 15 root 34 19 0 0 0 SWN 0.0 0.0 0:00 4 ksoftirqd/4 16 root 39 19 0 0 0 SWN 0.0 0.0 0:00 5 ksoftirqd/5 17 root 39 19 0 0 0 SWN 0.0 0.0 0:00 6 ksoftirqd/6 18 root 39 19 0 0 0 SWN 0.0 0.0 0:00 7 ksoftirqd/7 19 root 15 0 0 0 0 SW 0.0 0.0 2:22 7 kswapd 20 root 15 0 0 0 0 SW 0.0 0.0 0:01 7 kscand 22 root 15 0 0 0 0 SW 0.0 0.0 0:15 5 kupdated 23 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 aio/0 24 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 aio/1 26 root 15 0 0 0 0 SW 0.0 0.0 0:00 3 aio/3 27 root 15 0 0 0 0 SW 0.0 0.0 0:00 4 aio/4 28 root 15 0 0 0 0 SW 0.0 0.0 0:00 5 aio/5 29 root 15 0 0 0 0 SW 0.0 0.0 0:00 6 aio/6 30 root 15 0 0 0 0 SW 0.0 0.0 0:00 7 aio/7 31 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 mdrecoveryd 14:07:28 up 4:27, 5 users, load average: 3.64, 3.14, 2.13 219 processes: 218 sleeping, 1 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 0.1% 0.0% 0.7% 0.0% 0.1% 40.9% 57.8% cpu00 0.5% 0.0% 0.0% 0.0% 0.0% 0.0% 99.4% cpu01 0.0% 0.0% 0.0% 0.0% 0.0% 38.2% 61.7% cpu02 0.0% 0.0% 3.1% 0.0% 0.1% 96.6% 0.0% cpu03 0.1% 0.0% 1.3% 0.0% 0.1% 90.2% 7.9% cpu04 0.5% 0.1% 0.9% 0.0% 0.0% 0.0% 98.2% cpu05 0.1% 0.3% 0.0% 0.0% 0.5% 0.0% 98.8% cpu06 0.0% 0.0% 0.1% 0.0% 0.1% 3.1% 96.4% cpu07 0.0% 0.0% 0.5% 0.0% 0.0% 99.4% 0.0% Mem: 16423520k av, 10143028k used, 6280492k free, 0k shrd, 32344k buff 4423732k actv, 333508k in_d, 356492k in_c Swap: 8388600k av, 0k used, 8388600k free 5346464k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 25 root 15 0 0 0 0 DW 0.3 0.0 0:47 2 aio/2 1 root 19 0 524 524 460 S 0.0 0.0 0:06 3 init 2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0 3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1 4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2 migration/2 5 root RT 0 0 0 0 SW 0.0 0.0 0:00 3 migration/3 6 root RT 0 0 0 0 SW 0.0 0.0 0:00 4 migration/4 7 root RT 0 0 0 0 SW 0.0 0.0 0:00 5 migration/5 8 root RT 0 0 0 0 SW 0.0 0.0 0:00 6 migration/6 9 root RT 0 0 0 0 SW 0.0 0.0 0:00 7 migration/7 10 root 15 0 0 0 0 SW 0.0 0.0 0:00 5 keventd 11 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0 12 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1 13 root 34 19 0 0 0 SWN 0.0 0.0 0:00 2 ksoftirqd/2 14 root 39 19 0 0 0 SWN 0.0 0.0 0:00 3 ksoftirqd/3 15 root 34 19 0 0 0 SWN 0.0 0.0 0:00 4 ksoftirqd/4 16 root 39 19 0 0 0 SWN 0.0 0.0 0:00 5 ksoftirqd/5 17 root 39 19 0 0 0 SWN 0.0 0.0 0:00 6 ksoftirqd/6 18 root 39 19 0 0 0 SWN 0.0 0.0 0:00 7 ksoftirqd/7 21 root 15 0 0 0 0 DW 0.0 0.0 1:36 3 bdflush 19 root 15 0 0 0 0 SW 0.0 0.0 2:22 5 kswapd 20 root 15 0 0 0 0 SW 0.0 0.0 0:01 1 kscand 22 root 15 0 0 0 0 SW 0.0 0.0 0:15 6 kupdated 23 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 aio/0 24 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 aio/1 26 root 15 0 0 0 0 SW 0.0 0.0 0:00 3 aio/3 27 root 15 0 0 0 0 SW 0.0 0.0 0:00 4 aio/4 28 root 15 0 0 0 0 SW 0.0 0.0 0:00 5 aio/5 29 root 15 0 0 0 0 SW 0.0 0.0 0:00 6 aio/6 30 root 15 0 0 0 0 SW 0.0 0.0 0:00 7 aio/7 14:05:37 up 4:25, 5 users, load average: 3.02, 2.92, 1.95 220 processes: 219 sleeping, 1 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 0.3% 0.0% 2.1% 0.0% 0.7% 33.2% 63.4% cpu00 2.1% 0.0% 0.5% 0.0% 0.0% 0.0% 97.2% cpu01 0.3% 0.3% 3.1% 0.0% 0.0% 68.9% 27.0% cpu02 0.1% 0.0% 10.7% 0.0% 1.3% 87.6% 0.0% cpu03 0.0% 0.0% 0.1% 0.0% 0.7% 0.0% 99.0% cpu04 0.1% 0.0% 0.1% 0.0% 1.3% 0.0% 98.2% cpu05 0.0% 0.0% 0.1% 0.0% 2.3% 24.3% 73.1% cpu06 0.0% 0.0% 1.9% 0.0% 0.0% 84.0% 13.9% cpu07 0.0% 0.0% 0.1% 0.0% 0.0% 0.7% 99.0% Mem: 16423520k av, 9631480k used, 6792040k free, 0k shrd, 30528k buff 3903120k actv, 333056k in_d, 365384k in_c Swap: 8388600k av, 0k used, 8388600k free 4836372k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 25 root 15 0 0 0 0 DW 1.3 0.0 0:41 2 aio/2 21 root 15 0 0 0 0 SW 0.3 0.0 1:35 1 bdflush 24929 oracle 15 0 1155M 1.1G 1143M S 0.3 7.1 0:16 0 oracle 32492 root 15 0 0 0 0 DW 0.2 0.0 0:51 6 kjournald 1 root 15 0 524 524 460 S 0.0 0.0 0:06 3 init 2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0 3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1 4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2 migration/2 5 root RT 0 0 0 0 SW 0.0 0.0 0:00 3 migration/3 6 root RT 0 0 0 0 SW 0.0 0.0 0:00 4 migration/4 7 root RT 0 0 0 0 SW 0.0 0.0 0:00 5 migration/5 8 root RT 0 0 0 0 SW 0.0 0.0 0:00 6 migration/6 9 root RT 0 0 0 0 SW 0.0 0.0 0:00 7 migration/7 10 root 15 0 0 0 0 SW 0.0 0.0 0:00 5 keventd 11 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0 12 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1 13 root 34 19 0 0 0 SWN 0.0 0.0 0:00 2 ksoftirqd/2 14 root 39 19 0 0 0 SWN 0.0 0.0 0:00 3 ksoftirqd/3 15 root 34 19 0 0 0 SWN 0.0 0.0 0:00 4 ksoftirqd/4 16 root 39 19 0 0 0 SWN 0.0 0.0 0:00 5 ksoftirqd/5 17 root 39 19 0 0 0 SWN 0.0 0.0 0:00 6 ksoftirqd/6 18 root 39 19 0 0 0 SWN 0.0 0.0 0:00 7 ksoftirqd/7 19 root 15 0 0 0 0 SW 0.0 0.0 2:22 5 kswapd 20 root 15 0 0 0 0 SW 0.0 0.0 0:01 7 kscand 22 root 15 0 0 0 0 SW 0.0 0.0 0:15 6 kupdated 23 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 aio/0 24 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 aio/1 26 root 15 0 0 0 0 SW 0.0 0.0 0:00 3 aio/3 27 root 15 0 0 0 0 SW 0.0 0.0 0:00 4 aio/4 28 root 15 0 0 0 0 SW 0.0 0.0 0:00 5 aio/5 29 root 15 0 0 0 0 SW 0.0 0.0 0:00 6 aio/6 30 root 16 0 0 0 0 SW 0.0 0.0 0:00 7 aio/7 31 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 mdrecoveryd -------------------------------------------------------------------------------------------- Neither Direct IO or Async Elapsed: 00:05:28.39 xldn1336por(parkerjb)::/$ iostat -x 30|grep emcpowerm1 emcpowerm1 1.44 1870.89 0.01 0.00 64.16 16072.76 32.08 8036.38 1445633.40 1.45 0.25 88909.57 99.25 emcpowerm1 0.00 1.73 0.00 0.00 0.00 53.62 0.00 26.81 0.00 456.93 0.00 0.00 100.03 emcpowerm1 0.00 1.73 0.00 0.00 0.00 53.33 0.00 26.67 0.00 1121.24 0.00 0.00 100.00 emcpowerm1 0.00 1.77 0.00 0.00 0.00 53.07 0.00 26.53 0.00 162.69 0.00 0.00 100.00 emcpowerm1 0.00 5981.03 0.00 0.00 0.00 91080.00 0.00 45540.00 0.00 535.18 0.00 0.00 100.00 emcpowerm1 0.00 1850.80 0.00 0.00 0.00 36775.47 0.00 18387.73 0.00 172.98 0.00 0.00 100.00 emcpowerm1 0.00 1374.57 0.00 0.00 0.00 30032.00 0.00 15016.00 0.00 513.38 0.00 0.00 100.00 emcpowerm1 0.00 725.53 0.00 0.00 0.00 16170.93 0.00 8085.47 0.00 410.84 0.00 0.00 100.00 emcpowerm1 0.00 2715.60 0.00 0.00 0.00 27600.80 0.00 13800.40 0.00 450.69 0.00 0.00 100.00 emcpowerm1 0.00 3088.17 0.00 0.00 0.00 25614.13 0.00 12807.07 0.00 1032.48 0.00 0.00 100.00 emcpowerm1 0.00 2933.53 0.00 0.00 0.00 24372.00 0.00 12186.00 0.00 501.49 0.00 0.00 100.00 emcpowerm1 0.00 3226.83 0.00 0.00 0.00 26796.53 0.00 13398.27 0.00 83.28 0.00 0.00 100.00 emcpowerm1 0.00 12181.53 0.00 0.00 0.53 100927.20 0.27 50463.60 0.00 444.34 0.00 0.00 100.00 emcpowerm1 0.00 18399.07 0.00 0.00 0.00 152347.35 0.00 76173.68 0.00 16.72 0.00 0.00 99.97 emcpowerm1 0.00 18037.50 0.00 0.00 4.27 149355.73 2.13 74677.87 0.00 849.80 0.00 0.00 100.00 emcpowerm1 0.00 413.70 0.00 0.00 2.13 3550.67 1.07 1775.33 0.00 761.65 0.00 0.00 100.00 -------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------
We did not have EMC storage so did not/could not use powerpath in our testing. Are we basically running at the speed of the disk/IO subsystem. In this case Multithreading the I/O using AIO will only add overhead preventing and improvements that AIO might be able to sustain. Our experience with our partners has been AIO only has substanial payoff on large IO subsystems?
Interestingly the LUN shows loaded at 100% either way (from the above iostat), but the throughput rates on the sync/dio case is a multiple of the aio case. Overall we're not pushing this too hard, its a dedicated array with 3G of write cache and 10 spinning bits of copper for the LUN with 2 200MB/sec channels to the array. In the sync IO case, there's a burst in the 150M/sec rate where the aio case struggles to get past 30-40MB/sec and is generally at 10-12M/sec. tx steve
How many threads does oracle use to submit I/O? The patch I posted will queue processing to a thread on the same CPU that the I/O was scheduled. Thus, if the app is only doing AIO from two threads, and they are always scheduled on the same CPUs, then that would explain the behaviour you are seeing. Actually, looking at your top output above, it appears that only one aio thread is getting any CPU time. Hmm, do you really have 16GB of memory? If so, we should probably be testing with -hugemem. The smp kernel performs quite poorly with 16GB. We'll continue with the smp kernel, so as not to invalidate all of the numbers in this bz. However, please note that you should deploy hugemem in production. Also, could you get the output from sysrq-m while the tests are running? This will help to see if we are running into memory pressure issues. So, I'm putting together another kernel that will schedule I/O round-robin between the aio threads. I'll post a pointer to the kernel when it is ready. Thanks.
As a fix to note #29, the current testbed we are using to test the AIO kernels is an 8-cpu Xeon, with 8GB of memory using qlogic QLA2300 an EMC Clarion ... but we have not been using powerpath (yet) as we have only been using 1 FC adapter. Sorry for any confusion as I am out of town on a business trip this week.
I'm changing this back to ASSIGNED as it's really waiting on me.
Issue was closed, and does not meet the RHEL 3.9 release criteria.