Description of problem: the command line start iozone is : /opt/iozone/bin/iozone -s 2g -+u -t 2 -F /mnt/fa /mnt/fb, usually the test won't take more than 1 hour. and test order is: initial writers -> rewriters -> readers -> re-readers -> reverse readers -> stride readers -> random readers -> mixed workload -> random writers -> pwrite writers -> pread readers. I have tried 3 times on kvm-83-144.el5 when using rhel5u4 i386 + virtio_blk. the test is not finish after 16hour. top show the cpu usage of iozone process is zero. child process of iozone is stay in Disk Sleep statu,iostat show iowait is 98%. Can not reproduce the problem in kvm-83-140.el5. and on kvm-83-144.el5. so far, seems only RHEL guest + virtio_blk is affected. window guest + ide/virtio_blk => PASS rhel guest + ide => PASS The problem is found when do performance regression testing for Bug#552250. top - 23:59:37 up 12:39, 4 users, load average: 8.09, 8.03, 7.93 Tasks: 136 total, 1 running, 135 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 0.0%id,100.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 2075448k total, 1989812k used, 85636k free, 19352k buffers Swap: 2097144k total, 56k used, 2097088k free, 1784708k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2679 root 15 0 40984 21m 12m S 0.0 1.1 0:00.49 /usr/bin/sealer 2650 root 15 0 38356 20m 10m S 0.0 1.0 0:00.52 puplet 2814 root 16 0 40668 16m 660 S 0.0 0.8 0:00.30 iozone 2832 root 17 0 40668 16m 160 D 0.0 0.8 0:00.59 iozone 2833 root 18 0 40668 16m 160 D 0.0 0.8 0:00.58 iozone [root@localhost ~]# iostat Linux 2.6.18-164.11.1.el5 (localhost.localdomain) 01/14/2010 avg-cpu: %user %nice %system %iowait %steal %idle 0.03 0.02 0.38 98.40 0.00 1.17 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn hda 0.64 20.72 6.02 945928 274644 hda1 0.00 0.06 0.00 2633 20 hda2 0.63 20.64 6.01 942170 274624 dm-0 1.56 20.61 6.01 940794 274512 dm-1 0.00 0.02 0.00 896 112 vda 1.92 15.66 389.97 714947 17805126 vda1 1.91 15.63 389.97 713525 17805118 Version-Release number of selected component (if applicable): kvm-83-144.el5 How reproducible: 100% Steps to Reproduce: 1.create a raw format image file : qemu-img create -f raw test.raw 30G. 2.boot a rhel5u4 guest with test.raw, cli I used is : /usr/libexec/qemu-kvm -m 2048 -smp 2 -no-kvm-pit-reinjection -rtc-td-hack -drive file=/date/t77/RHEL-Server-5.4-32-virtio.qcow2 -net nic,macaddr=00:21:01:02:5F:01 -net tap -monitor stdio -vnc :2 -drive file=test.raw,if=virtio -uuid 6dc1c9ab-5b22-4960-a6aa-b7becff05ee5 3.in the guest,create partition on /dev/vda 4.make fs on /dev/vda1 5.mount /dev/vda1 /mnt 6.run iozone. /opt/iozone/bin/iozone -s 2g -+u -t 2 -F /mnt/fa /mnt/fb Actual results: Expected results: Additional info:
Just to make sure: the test works fine with kvm-83-140.el5, but breaks with kvm-83-144.el5 with everything else left the same? The host also is a generic RHEL5.4/5 system?
Between those releases there are very few changes touching the block code. The one with the biggest line count impact is Juan's savevm version support, but the only one that actually looks relevant to I/O is Vadim's queue notify support: commit bad54220ef2133fac3f1978f40d0e82f5b5dfc6c Author: Vadim Rozenfeld <vrozenfe> Date: Sun Jan 3 21:55:56 2010 -0200 Queue notify support for virtio block device.
Yes, it is. Actually, adding notify support was the reason why we asked QA team for running some sort of disk IO stress test on Windows and Linux guests. IMO, the simplest way to fix this bug without changing Linux virtio block driver is reverting the above commit. Dor, what do you think?
Isn't there a race there that we can fix? It should be easy to debug since there is probably an unacked request. Until you do that, lets revert it from 5.5 to reduce problems.
Agree, let's revert it. I have some experimental version, which utilizes idle bottom half for queue notification (suggested by Anthony). But it still needs more time for tweaking and testing with Windows and Linux quests.
Christoph recommended to switch the order of notification that revealed this race: > + virtio_blk_handle_output(&s->vdev, s->vq); > + if(--s->pending == 0) > + virtio_queue_set_notification(s->vq, 1); Here is a brew build with it: https://brewweb.devel.redhat.com/taskinfo?taskID=2223209 Can you please retest with the above?
(In reply to comment #7) > Christoph recommended to switch the order of notification that revealed this > race: > > > + virtio_blk_handle_output(&s->vdev, s->vq); > > + if(--s->pending == 0) > > + virtio_queue_set_notification(s->vq, 1); > > Here is a brew build with it: > https://brewweb.devel.redhat.com/taskinfo?taskID=2223209 > > Can you please retest with the above? I just finished one test runner now. using command line /opt/iozone/bin/iozone -s 2g -+u -t 2 -F /mnt/fa /mnt/fb (record size is 4kb by default). the iozone process is ended.but it cost nearly there and half hours. I remember on kvm-83-140.el5. which used less then 1 hour. so it is still unsatisfactory and dmesg show iozone is blocked for more than 120 seconds Now,I retest with "-t 4" on the guest, should I go on to do the performance regression testing on both RHEL and Linux guest with this scratch build ? [root@test opt]# /opt/iozone/bin/iozone -s 2g -+u -t 2 -F /mnt/fa /mnt/fb Iozone: Performance Test of File I/O Version $Revision: 3.327 $ Compiled for 32 bit mode. Build: linux Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root. Run began: Mon Jan 25 06:53:35 2010 File size set to 2097152 KB CPU utilization Resolution = 0.000 seconds. CPU utilization Excel chart enabled Command line used: /opt/iozone/bin/iozone -s 2g -+u -t 2 -F /mnt/fa /mnt/fb Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Throughput test with 2 processes Each process writes a 2097152 Kbyte file in 4 Kbyte records Children see throughput for 2 initial writers = 34107.53 KB/sec Parent sees throughput for 2 initial writers = 25302.25 KB/sec Min throughput per process = 14972.12 KB/sec Max throughput per process = 19135.41 KB/sec Avg throughput per process = 17053.76 KB/sec Min xfer = 1648256.00 KB CPU Utilization: Wall time 124.019 CPU time 35.575 CPU utilization 28.68 % Children see throughput for 2 rewriters = 31133.55 KB/sec Parent sees throughput for 2 rewriters = 26434.81 KB/sec Min throughput per process = 15453.63 KB/sec Max throughput per process = 15679.92 KB/sec Avg throughput per process = 15566.77 KB/sec Min xfer = 2070288.00 KB CPU utilization: Wall time 133.968 CPU time 12.329 CPU utilization 9.20 % Children see throughput for 2 readers = 39591.90 KB/sec Parent sees throughput for 2 readers = 39583.81 KB/sec Min throughput per process = 18862.49 KB/sec Max throughput per process = 20729.40 KB/sec Avg throughput per process = 19795.95 KB/sec Min xfer = 1908464.00 KB CPU utilization: Wall time 101.178 CPU time 8.301 CPU utilization 8.20 % Children see throughput for 2 re-readers = 35913.27 KB/sec Parent sees throughput for 2 re-readers = 35908.36 KB/sec Min throughput per process = 17872.24 KB/sec Max throughput per process = 18041.03 KB/sec Avg throughput per process = 17956.64 KB/sec Min xfer = 2077680.00 KB CPU utilization: Wall time 116.252 CPU time 7.961 CPU utilization 6.85 % Children see throughput for 2 reverse readers = 801.94 KB/sec Parent sees throughput for 2 reverse readers = 801.94 KB/sec Min throughput per process = 400.16 KB/sec Max throughput per process = 401.78 KB/sec Avg throughput per process = 400.97 KB/sec Min xfer = 2088732.00 KB CPU utilization: Wall time 5219.679 CPU time 5.167 CPU utilization 0.10 % Children see throughput for 2 stride readers = 2575.00 KB/sec Parent sees throughput for 2 stride readers = 2574.99 KB/sec Min throughput per process = 1270.22 KB/sec Max throughput per process = 1304.78 KB/sec Avg throughput per process = 1287.50 KB/sec Min xfer = 2041616.00 KB CPU utilization: Wall time 1607.290 CPU time 72.550 CPU utilization 4.51 % Children see throughput for 2 random readers = 20446.81 KB/sec Parent sees throughput for 2 random readers = 20446.07 KB/sec Min throughput per process = 10144.86 KB/sec Max throughput per process = 10301.96 KB/sec Avg throughput per process = 10223.41 KB/sec Min xfer = 2065156.00 KB CPU utilization: Wall time 203.568 CPU time 96.264 CPU utilization 47.29 % Children see throughput for 2 mixed workload = 2073.67 KB/sec Parent sees throughput for 2 mixed workload = 1517.79 KB/sec Min throughput per process = 104.92 KB/sec Max throughput per process = 1968.75 KB/sec Avg throughput per process = 1036.84 KB/sec Min xfer = 111764.00 KB CPU utilization: Wall time 1065.228 CPU time 8.909 CPU utilization 0.84 % Children see throughput for 2 random writers = 2258.73 KB/sec Parent sees throughput for 2 random writers = 1707.53 KB/sec Min throughput per process = 1092.29 KB/sec Max throughput per process = 1166.44 KB/sec Avg throughput per process = 1129.36 KB/sec Min xfer = 1971792.00 KB CPU utilization: Wall time 1886.754 CPU time 17.807 CPU utilization 0.94 % Children see throughput for 2 pwrite writers = 37613.20 KB/sec Parent sees throughput for 2 pwrite writers = 30085.88 KB/sec Min throughput per process = 18497.13 KB/sec Max throughput per process = 19116.07 KB/sec Avg throughput per process = 18806.60 KB/sec Min xfer = 2031840.00 KB CPU utilization: Wall time 113.643 CPU time 31.501 CPU utilization 27.72 % Children see throughput for 2 pread readers = 32881.63 KB/sec Parent sees throughput for 2 pread readers = 32846.55 KB/sec Min throughput per process = 16302.49 KB/sec Max throughput per process = 16579.14 KB/sec Avg throughput per process = 16440.81 KB/sec Min xfer = 2062320.00 KB CPU utilization: Wall time 126.503 CPU time 7.002 CPU utilization 5.53 % INFO: task iozone:32482 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. iozone D 000007F4 2512 32482 2689 32483 (NOTLB) da9c5ed0 00000086 4f32cbf6 000007f4 000007f4 0000000e 00000000 00000007 f5fc2000 4f32d09a 000007f4 000004a4 00000001 f5fc210c ca0132c4 f78d3900 f5a99468 00000000 00000001 da9c5ecc c041eb94 00000000 da9c5ed8 ffffffff Call Trace: [<c041eb94>] __wake_up+0x2a/0x3d [<f88511da>] log_wait_commit+0x80/0xc7 [jbd] [<c0435fd7>] autoremove_wake_function+0x0/0x2d [<f884c661>] journal_stop+0x195/0x1ba [jbd] [<c0494436>] __writeback_single_inode+0x1a3/0x2af [<c045db8a>] do_writepages+0x2b/0x32 [<c04596f7>] __filemap_fdatawrite_range+0x66/0x72 [<c0494ad2>] sync_inode+0x19/0x24 [<f8882007>] ext3_sync_file+0xaf/0xc4 [ext3] [<c0477ff7>] do_fsync+0x41/0x83 [<c0478056>] __do_fsync+0x1d/0x2b [<c0404f17>] syscall_call+0x7/0xb ======================= INFO: task iozone:32482 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. iozone D 000007F4 2512 32482 2689 32483 (NOTLB) da9c5ed0 00000086 4f32cbf6 000007f4 000007f4 0000000e 00000000 00000007 f5fc2000 4f32d09a 000007f4 000004a4 00000001 f5fc210c ca0132c4 f78d3900 f5a99468 00000000 00000001 da9c5ecc c041eb94 00000000 da9c5ed8 ffffffff Call Trace: [<c041eb94>] __wake_up+0x2a/0x3d [<f88511da>] log_wait_commit+0x80/0xc7 [jbd] [<c0435fd7>] autoremove_wake_function+0x0/0x2d [<f884c661>] journal_stop+0x195/0x1ba [jbd] [<c0494436>] __writeback_single_inode+0x1a3/0x2af [<c045db8a>] do_writepages+0x2b/0x32 [<c04596f7>] __filemap_fdatawrite_range+0x66/0x72 [<c0494ad2>] sync_inode+0x19/0x24 [<f8882007>] ext3_sync_file+0xaf/0xc4 [ext3] [<c0477ff7>] do_fsync+0x41/0x83 [<c0478056>] __do_fsync+0x1d/0x2b [<c0404f17>] syscall_call+0x7/0xb ======================= INFO: task iozone:32482 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. iozone D 000007F4 2512 32482 2689 32483 (NOTLB) da9c5ed0 00000086 4f32cbf6 000007f4 000007f4 0000000e 00000000 00000007 f5fc2000 4f32d09a 000007f4 000004a4 00000001 f5fc210c ca0132c4 f78d3900 f5a99468 00000000 00000001 da9c5ecc c041eb94 00000000 da9c5ed8 ffffffff Call Trace: [<c041eb94>] __wake_up+0x2a/0x3d [<f88511da>] log_wait_commit+0x80/0xc7 [jbd] [<c0435fd7>] autoremove_wake_function+0x0/0x2d [<f884c661>] journal_stop+0x195/0x1ba [jbd] [<c0494436>] __writeback_single_inode+0x1a3/0x2af [<c045db8a>] do_writepages+0x2b/0x32 [<c04596f7>] __filemap_fdatawrite_range+0x66/0x72 [<c0494ad2>] sync_inode+0x19/0x24 [<f8882007>] ext3_sync_file+0xaf/0xc4 [ext3] [<c0477ff7>] do_fsync+0x41/0x83 [<c0478056>] __do_fsync+0x1d/0x2b [<c0404f17>] syscall_call+0x7/0xb ======================= [root@test opt]# [root@test ~]# sar Linux 2.6.18-183.el5 (test.54.32) 01/25/2010 07:40:01 AM CPU %user %nice %system %iowait %steal %idle 07:50:01 AM all 0.02 0.00 0.13 49.96 0.00 49.89 08:00:01 AM all 0.01 0.60 1.22 49.33 0.00 48.85 08:10:01 AM all 0.01 15.19 25.14 31.07 0.00 28.59 08:20:01 AM all 0.01 3.83 6.53 44.84 0.00 44.79 08:30:01 AM all 0.03 0.00 0.12 49.58 0.00 50.27 08:40:01 AM all 0.08 0.00 4.48 45.76 0.00 49.68 08:50:01 AM all 0.06 0.00 3.70 46.51 0.00 49.72 09:00:18 AM all 0.17 0.18 13.82 37.81 0.00 48.02 09:10:01 AM all 0.04 0.00 1.67 98.27 0.00 0.02 09:20:01 AM all 0.04 0.00 1.68 97.51 0.00 0.78 09:30:01 AM all 0.07 0.00 1.93 96.89 0.00 1.11 09:40:01 AM all 0.05 0.00 2.00 97.71 0.00 0.23 09:50:01 AM all 0.04 0.00 2.06 96.90 0.00 1.00 10:00:01 AM all 0.04 0.03 1.65 98.02 0.00 0.27 10:10:02 AM all 0.10 0.16 10.50 63.61 0.00 25.62 Average: all 0.05 1.33 5.13 66.80 0.00 26.68 [root@test ~]#
Christoph/Vadim, any reason why it was blocking for that long? Is there still a race? lihuang, what's the physical disk array parallelism (spindle count) that you have? We might need to add a parameter for testing.
(In reply to comment #9) > Christoph/Vadim, any reason why it was blocking for that long? > Is there still a race? > > lihuang, what's the physical disk array parallelism (spindle count) that you > have? We might need to add a parameter for testing. The Host is Dell OPTIPLEX 760 with only one physical disk. and when I using "-t 4", the test is not finish in 12hours (current test is random readers )
In the end serializing the I/O using this scheme we do not just serialize disk parallelism but also any I/O parallelism in the host, be that buffered I/O if we use it or the I/O scheduler in the host beeing able to merge the I/Os more efficiently. And from looking at the bug it seems to use the qemu defaults, thus using the pagecache in the host.
Created attachment 386792 [details] log of -t4 (In reply to comment #10) > (In reply to comment #9) > > Christoph/Vadim, any reason why it was blocking for that long? > > Is there still a race? > > > > lihuang, what's the physical disk array parallelism (spindle count) that you > > have? We might need to add a parameter for testing. > > The Host is Dell OPTIPLEX 760 with only one physical disk. > > > and when I using "-t 4", the test is not finish in 12hours (current test is > random readers ) I decide to stop the test now,it is just start "random writers" testing after 20hours. please let me know if other testing is needed . Thanks Lijun Huang
(In reply to comment #11) > In the end serializing the I/O using this scheme we do not just serialize disk > parallelism but also any I/O parallelism in the host, be that buffered I/O if > we use it or the I/O scheduler in the host beeing able to merge the I/Os more > efficiently. And from looking at the bug it seems to use the qemu defaults, > thus using the pagecache in the host. Christoph is right, QE did not use cache=off in the above command. Please *always* do that.
We reverted the performance enhancements in kvm-83-152.el5 so, the bug will go through modified -> on QE
(In reply to comment #14) > We reverted the performance enhancements in kvm-83-152.el5 so, the bug will go > through modified -> on QE Yes. The problem is gone in kvm >152. (the correct status for the bug is ? VERIFIED ? NOTABUG ? )
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0271.html