555780 – iozone test can not finish when using virtio_blk in RHEL5u4 guest.

Bug 555780 - iozone test can not finish when using virtio_blk in RHEL5u4 guest.

Summary: iozone test can not finish when using virtio_blk in RHEL5u4 guest.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kvm
Sub Component:
Version:	5.5
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	chellwig@redhat.com
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-01-15 14:17 UTC by lihuang
Modified:	2013-01-09 22:13 UTC (History)
CC List:	7 users (show)
Fixed In Version:	kvm-83-152.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-03-30 07:51:50 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
log of -t4 (7.42 KB, text/plain) 2010-01-26 10:24 UTC, lihuang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0271	0	normal	SHIPPED_LIVE	Important: kvm security, bug fix and enhancement update	2010-03-29 13:19:48 UTC

Description lihuang 2010-01-15 14:17:00 UTC

Description of problem:
the command line start iozone is : /opt/iozone/bin/iozone -s 2g -+u -t 2 -F /mnt/fa /mnt/fb, usually the test won't take more than 1 hour.

and test order is: initial writers -> rewriters -> readers -> re-readers -> reverse readers -> stride readers -> random readers -> mixed workload -> random writers -> pwrite writers -> pread readers.

I have tried 3 times on kvm-83-144.el5 when using rhel5u4 i386 + virtio_blk.
the test is not finish after 16hour. top show the cpu usage of iozone process is zero. child process of iozone is stay in Disk Sleep statu,iostat show iowait is 98%.

Can not reproduce the problem in kvm-83-140.el5.
and on kvm-83-144.el5. so far, seems only RHEL guest + virtio_blk is affected.
window guest + ide/virtio_blk => PASS
rhel guest   + ide            => PASS

The problem is found when do performance regression testing for Bug#552250.


top - 23:59:37 up 12:39,  4 users,  load average: 8.09, 8.03, 7.93
Tasks: 136 total,   1 running, 135 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,100.0%wa,  0.0%hi,  0.0%si, 0.0%st
Mem:   2075448k total,  1989812k used,    85636k free,    19352k buffers
Swap:  2097144k total,       56k used,  2097088k free,  1784708k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

 2679 root      15   0 40984  21m  12m S  0.0  1.1   0:00.49 /usr/bin/sealer
 2650 root      15   0 38356  20m  10m S  0.0  1.0   0:00.52 puplet

 2814 root      16   0 40668  16m  660 S  0.0  0.8   0:00.30 iozone

 2832 root      17   0 40668  16m  160 D  0.0  0.8   0:00.59 iozone

 2833 root      18   0 40668  16m  160 D  0.0  0.8   0:00.58 iozone 

[root@localhost ~]# iostat
Linux 2.6.18-164.11.1.el5 (localhost.localdomain)       01/14/2010

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.03    0.02    0.38   98.40    0.00    1.17

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
hda               0.64        20.72         6.02     945928     274644
hda1              0.00         0.06         0.00       2633         20
hda2              0.63        20.64         6.01     942170     274624
dm-0              1.56        20.61         6.01     940794     274512
dm-1              0.00         0.02         0.00        896        112
vda               1.92        15.66       389.97     714947   17805126
vda1              1.91        15.63       389.97     713525   17805118 



Version-Release number of selected component (if applicable):
kvm-83-144.el5


How reproducible:
100%

Steps to Reproduce:
1.create a raw format image file : qemu-img create -f raw test.raw 30G.
2.boot a rhel5u4 guest with test.raw, cli I used is :
  /usr/libexec/qemu-kvm -m 2048 -smp 2 -no-kvm-pit-reinjection -rtc-td-hack -drive file=/date/t77/RHEL-Server-5.4-32-virtio.qcow2 -net nic,macaddr=00:21:01:02:5F:01 -net tap -monitor stdio -vnc :2 -drive file=test.raw,if=virtio -uuid 6dc1c9ab-5b22-4960-a6aa-b7becff05ee5 
3.in the guest,create partition on /dev/vda
4.make fs on /dev/vda1
5.mount /dev/vda1 /mnt
6.run iozone. /opt/iozone/bin/iozone -s 2g -+u -t 2 -F /mnt/fa /mnt/fb

  
Actual results:


Expected results:


Additional info:

Comment 2 chellwig@redhat.com 2010-01-20 11:06:20 UTC

Just to make sure:  the test works fine with kvm-83-140.el5, but breaks with kvm-83-144.el5 with everything else left the same?  The host also is a generic RHEL5.4/5 system?

Comment 3 chellwig@redhat.com 2010-01-20 11:15:35 UTC

Between those releases there are very few changes touching the block code.  The one with the biggest line count impact is Juan's savevm version support, but the only one that actually looks relevant to I/O is Vadim's queue notify support:

commit bad54220ef2133fac3f1978f40d0e82f5b5dfc6c
Author: Vadim Rozenfeld <vrozenfe>
Date:   Sun Jan 3 21:55:56 2010 -0200

    Queue notify support for virtio block device.

Comment 4 Vadim Rozenfeld 2010-01-20 11:49:24 UTC

Yes, it is.
Actually, adding notify support was the reason why we asked QA team for running
some sort of disk IO stress test on Windows and Linux guests.

IMO, the simplest way to fix this bug without changing Linux virtio block driver is reverting the above commit.
Dor, what do you think?

Comment 5 Dor Laor 2010-01-24 12:43:48 UTC

Isn't there a race there that we can fix?
It should be easy to debug since there is probably an unacked request.

Until you do that, lets revert it from 5.5 to reduce problems.

Comment 6 Vadim Rozenfeld 2010-01-24 13:01:57 UTC

Agree, let's revert it.
I have some experimental version, which utilizes idle bottom half for queue notification (suggested by Anthony). But it still needs more time for tweaking and testing with Windows and Linux quests.

Comment 7 Dor Laor 2010-01-24 21:36:20 UTC

Christoph recommended to switch the order of notification that revealed this race:

> +    virtio_blk_handle_output(&s->vdev, s->vq);
> +    if(--s->pending == 0)
> +        virtio_queue_set_notification(s->vq, 1);

Here is a brew build with it:
https://brewweb.devel.redhat.com/taskinfo?taskID=2223209

Can you please retest with the above?

Comment 8 lihuang 2010-01-25 15:39:27 UTC

(In reply to comment #7)
> Christoph recommended to switch the order of notification that revealed this
> race:
> 
> > +    virtio_blk_handle_output(&s->vdev, s->vq);
> > +    if(--s->pending == 0)
> > +        virtio_queue_set_notification(s->vq, 1);
> 
> Here is a brew build with it:
> https://brewweb.devel.redhat.com/taskinfo?taskID=2223209
> 
> Can you please retest with the above?    

I just finished one test runner now.
using command line /opt/iozone/bin/iozone -s 2g -+u -t 2 -F /mnt/fa /mnt/fb 
(record size is 4kb  by default).
the iozone process is ended.but it cost nearly there and half hours. I remember on kvm-83-140.el5. which used less then 1 hour. so it is still unsatisfactory

and dmesg show iozone is blocked for more than 120 seconds

Now,I retest with "-t 4" on the guest,
should I go on to do the performance regression testing on both RHEL and Linux guest with this scratch build ?


[root@test opt]# /opt/iozone/bin/iozone -s 2g -+u -t 2 -F /mnt/fa /mnt/fb
        Iozone: Performance Test of File I/O
                Version $Revision: 3.327 $
                Compiled for 32 bit mode.
                Build: linux 

        Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
                     Al Slater, Scott Rhine, Mike Wisner, Ken Goss
                     Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
                     Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
                     Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy,
                     Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root.

        Run began: Mon Jan 25 06:53:35 2010

        File size set to 2097152 KB
        CPU utilization Resolution = 0.000 seconds.
        CPU utilization Excel chart enabled
        Command line used: /opt/iozone/bin/iozone -s 2g -+u -t 2 -F /mnt/fa /mnt/fb
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
        Throughput test with 2 processes
        Each process writes a 2097152 Kbyte file in 4 Kbyte records

        Children see throughput for  2 initial writers  =   34107.53 KB/sec
        Parent sees throughput for  2 initial writers   =   25302.25 KB/sec
        Min throughput per process                      =   14972.12 KB/sec 
        Max throughput per process                      =   19135.41 KB/sec
        Avg throughput per process                      =   17053.76 KB/sec
        Min xfer                                        = 1648256.00 KB
        CPU Utilization: Wall time  124.019    CPU time   35.575    CPU utilization  28.68 %


        Children see throughput for  2 rewriters        =   31133.55 KB/sec
        Parent sees throughput for  2 rewriters         =   26434.81 KB/sec
        Min throughput per process                      =   15453.63 KB/sec 
        Max throughput per process                      =   15679.92 KB/sec
        Avg throughput per process                      =   15566.77 KB/sec
        Min xfer                                        = 2070288.00 KB
        CPU utilization: Wall time  133.968    CPU time   12.329    CPU utilization   9.20 %


        Children see throughput for  2 readers          =   39591.90 KB/sec
        Parent sees throughput for  2 readers           =   39583.81 KB/sec
        Min throughput per process                      =   18862.49 KB/sec 
        Max throughput per process                      =   20729.40 KB/sec
        Avg throughput per process                      =   19795.95 KB/sec
        Min xfer                                        = 1908464.00 KB
        CPU utilization: Wall time  101.178    CPU time    8.301    CPU utilization   8.20 %


        Children see throughput for 2 re-readers        =   35913.27 KB/sec
        Parent sees throughput for 2 re-readers         =   35908.36 KB/sec
        Min throughput per process                      =   17872.24 KB/sec 
        Max throughput per process                      =   18041.03 KB/sec
        Avg throughput per process                      =   17956.64 KB/sec
        Min xfer                                        = 2077680.00 KB
        CPU utilization: Wall time  116.252    CPU time    7.961    CPU utilization   6.85 %


        Children see throughput for 2 reverse readers   =     801.94 KB/sec
        Parent sees throughput for 2 reverse readers    =     801.94 KB/sec
        Min throughput per process                      =     400.16 KB/sec 
        Max throughput per process                      =     401.78 KB/sec
        Avg throughput per process                      =     400.97 KB/sec
        Min xfer                                        = 2088732.00 KB
        CPU utilization: Wall time 5219.679    CPU time    5.167    CPU utilization   0.10 %


        Children see throughput for 2 stride readers    =    2575.00 KB/sec
        Parent sees throughput for 2 stride readers     =    2574.99 KB/sec
        Min throughput per process                      =    1270.22 KB/sec 
        Max throughput per process                      =    1304.78 KB/sec
        Avg throughput per process                      =    1287.50 KB/sec
        Min xfer                                        = 2041616.00 KB
        CPU utilization: Wall time 1607.290    CPU time   72.550    CPU utilization   4.51 %


        Children see throughput for 2 random readers    =   20446.81 KB/sec
        Parent sees throughput for 2 random readers     =   20446.07 KB/sec
        Min throughput per process                      =   10144.86 KB/sec 
        Max throughput per process                      =   10301.96 KB/sec
        Avg throughput per process                      =   10223.41 KB/sec
        Min xfer                                        = 2065156.00 KB
        CPU utilization: Wall time  203.568    CPU time   96.264    CPU utilization  47.29 %


        Children see throughput for 2 mixed workload    =    2073.67 KB/sec
        Parent sees throughput for 2 mixed workload     =    1517.79 KB/sec
        Min throughput per process                      =     104.92 KB/sec 
        Max throughput per process                      =    1968.75 KB/sec
        Avg throughput per process                      =    1036.84 KB/sec
        Min xfer                                        =  111764.00 KB
        CPU utilization: Wall time 1065.228    CPU time    8.909    CPU utilization   0.84 %


        Children see throughput for 2 random writers    =    2258.73 KB/sec
        Parent sees throughput for 2 random writers     =    1707.53 KB/sec
        Min throughput per process                      =    1092.29 KB/sec 
        Max throughput per process                      =    1166.44 KB/sec
        Avg throughput per process                      =    1129.36 KB/sec
        Min xfer                                        = 1971792.00 KB
        CPU utilization: Wall time 1886.754    CPU time   17.807    CPU utilization   0.94 %


        Children see throughput for 2 pwrite writers    =   37613.20 KB/sec
        Parent sees throughput for 2 pwrite writers     =   30085.88 KB/sec
        Min throughput per process                      =   18497.13 KB/sec 
        Max throughput per process                      =   19116.07 KB/sec
        Avg throughput per process                      =   18806.60 KB/sec
        Min xfer                                        = 2031840.00 KB
        CPU utilization: Wall time  113.643    CPU time   31.501    CPU utilization  27.72 %


        Children see throughput for 2 pread readers     =   32881.63 KB/sec
        Parent sees throughput for 2 pread readers      =   32846.55 KB/sec
        Min throughput per process                      =   16302.49 KB/sec 
        Max throughput per process                      =   16579.14 KB/sec
        Avg throughput per process                      =   16440.81 KB/sec
        Min xfer                                        = 2062320.00 KB
        CPU utilization: Wall time  126.503    CPU time    7.002    CPU utilization   5.53 %

INFO: task iozone:32482 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
iozone        D 000007F4  2512 32482   2689         32483       (NOTLB)
       da9c5ed0 00000086 4f32cbf6 000007f4 000007f4 0000000e 00000000 00000007 
       f5fc2000 4f32d09a 000007f4 000004a4 00000001 f5fc210c ca0132c4 f78d3900 
       f5a99468 00000000 00000001 da9c5ecc c041eb94 00000000 da9c5ed8 ffffffff 
Call Trace:
 [<c041eb94>] __wake_up+0x2a/0x3d
 [<f88511da>] log_wait_commit+0x80/0xc7 [jbd]
 [<c0435fd7>] autoremove_wake_function+0x0/0x2d
 [<f884c661>] journal_stop+0x195/0x1ba [jbd]
 [<c0494436>] __writeback_single_inode+0x1a3/0x2af
 [<c045db8a>] do_writepages+0x2b/0x32
 [<c04596f7>] __filemap_fdatawrite_range+0x66/0x72
 [<c0494ad2>] sync_inode+0x19/0x24
 [<f8882007>] ext3_sync_file+0xaf/0xc4 [ext3]
 [<c0477ff7>] do_fsync+0x41/0x83
 [<c0478056>] __do_fsync+0x1d/0x2b
 [<c0404f17>] syscall_call+0x7/0xb
 =======================
INFO: task iozone:32482 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
iozone        D 000007F4  2512 32482   2689         32483       (NOTLB)
       da9c5ed0 00000086 4f32cbf6 000007f4 000007f4 0000000e 00000000 00000007 
       f5fc2000 4f32d09a 000007f4 000004a4 00000001 f5fc210c ca0132c4 f78d3900 
       f5a99468 00000000 00000001 da9c5ecc c041eb94 00000000 da9c5ed8 ffffffff 
Call Trace:
 [<c041eb94>] __wake_up+0x2a/0x3d
 [<f88511da>] log_wait_commit+0x80/0xc7 [jbd]
 [<c0435fd7>] autoremove_wake_function+0x0/0x2d
 [<f884c661>] journal_stop+0x195/0x1ba [jbd]
 [<c0494436>] __writeback_single_inode+0x1a3/0x2af
 [<c045db8a>] do_writepages+0x2b/0x32
 [<c04596f7>] __filemap_fdatawrite_range+0x66/0x72
 [<c0494ad2>] sync_inode+0x19/0x24
 [<f8882007>] ext3_sync_file+0xaf/0xc4 [ext3]
 [<c0477ff7>] do_fsync+0x41/0x83
 [<c0478056>] __do_fsync+0x1d/0x2b
 [<c0404f17>] syscall_call+0x7/0xb
 =======================
INFO: task iozone:32482 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
iozone        D 000007F4  2512 32482   2689         32483       (NOTLB)
       da9c5ed0 00000086 4f32cbf6 000007f4 000007f4 0000000e 00000000 00000007 
       f5fc2000 4f32d09a 000007f4 000004a4 00000001 f5fc210c ca0132c4 f78d3900 
       f5a99468 00000000 00000001 da9c5ecc c041eb94 00000000 da9c5ed8 ffffffff 
Call Trace:
 [<c041eb94>] __wake_up+0x2a/0x3d
 [<f88511da>] log_wait_commit+0x80/0xc7 [jbd]
 [<c0435fd7>] autoremove_wake_function+0x0/0x2d
 [<f884c661>] journal_stop+0x195/0x1ba [jbd]
 [<c0494436>] __writeback_single_inode+0x1a3/0x2af
 [<c045db8a>] do_writepages+0x2b/0x32
 [<c04596f7>] __filemap_fdatawrite_range+0x66/0x72
 [<c0494ad2>] sync_inode+0x19/0x24
 [<f8882007>] ext3_sync_file+0xaf/0xc4 [ext3]
 [<c0477ff7>] do_fsync+0x41/0x83
 [<c0478056>] __do_fsync+0x1d/0x2b
 [<c0404f17>] syscall_call+0x7/0xb
 =======================
[root@test opt]#


[root@test ~]# sar
Linux 2.6.18-183.el5 (test.54.32)       01/25/2010

07:40:01 AM       CPU     %user     %nice   %system   %iowait    %steal     %idle
07:50:01 AM       all      0.02      0.00      0.13     49.96      0.00     49.89
08:00:01 AM       all      0.01      0.60      1.22     49.33      0.00     48.85
08:10:01 AM       all      0.01     15.19     25.14     31.07      0.00     28.59
08:20:01 AM       all      0.01      3.83      6.53     44.84      0.00     44.79
08:30:01 AM       all      0.03      0.00      0.12     49.58      0.00     50.27
08:40:01 AM       all      0.08      0.00      4.48     45.76      0.00     49.68
08:50:01 AM       all      0.06      0.00      3.70     46.51      0.00     49.72
09:00:18 AM       all      0.17      0.18     13.82     37.81      0.00     48.02
09:10:01 AM       all      0.04      0.00      1.67     98.27      0.00      0.02
09:20:01 AM       all      0.04      0.00      1.68     97.51      0.00      0.78
09:30:01 AM       all      0.07      0.00      1.93     96.89      0.00      1.11
09:40:01 AM       all      0.05      0.00      2.00     97.71      0.00      0.23
09:50:01 AM       all      0.04      0.00      2.06     96.90      0.00      1.00
10:00:01 AM       all      0.04      0.03      1.65     98.02      0.00      0.27
10:10:02 AM       all      0.10      0.16     10.50     63.61      0.00     25.62
Average:          all      0.05      1.33      5.13     66.80      0.00     26.68
[root@test ~]#

Comment 9 Dor Laor 2010-01-25 22:47:45 UTC

Christoph/Vadim, any reason why it was blocking for that long?
Is there still a race?

lihuang, what's the physical disk array parallelism (spindle count) that you have? We might need to add a parameter for testing.

Comment 10 lihuang 2010-01-26 03:13:25 UTC

(In reply to comment #9)
> Christoph/Vadim, any reason why it was blocking for that long?
> Is there still a race?
> 
> lihuang, what's the physical disk array parallelism (spindle count) that you
> have? We might need to add a parameter for testing.    

The Host is Dell OPTIPLEX 760 with only one physical disk.


and when I using "-t 4", the test is not finish in 12hours (current test is random readers )

Comment 11 chellwig@redhat.com 2010-01-26 09:11:02 UTC

In the end serializing the I/O using this scheme we do not just serialize disk parallelism but also any I/O parallelism in the host, be that buffered I/O if we use it or the I/O scheduler in the host beeing able to merge the I/Os more efficiently.  And from looking at the bug it seems to use the qemu defaults, thus using the pagecache in the host.

Comment 12 lihuang 2010-01-26 10:24:45 UTC

Created attachment 386792 [details]
log of -t4

(In reply to comment #10)
> (In reply to comment #9)
> > Christoph/Vadim, any reason why it was blocking for that long?
> > Is there still a race?
> > 
> > lihuang, what's the physical disk array parallelism (spindle count) that you
> > have? We might need to add a parameter for testing.    
> 
> The Host is Dell OPTIPLEX 760 with only one physical disk.
> 
> 
> and when I using "-t 4", the test is not finish in 12hours (current test is
> random readers )    

I decide to stop the test now,it is just start "random
writers" testing after 20hours.

please let me know if other testing is needed .

Thanks
Lijun Huang

Comment 13 Dor Laor 2010-02-18 08:33:55 UTC

(In reply to comment #11)
> In the end serializing the I/O using this scheme we do not just serialize disk
> parallelism but also any I/O parallelism in the host, be that buffered I/O if
> we use it or the I/O scheduler in the host beeing able to merge the I/Os more
> efficiently.  And from looking at the bug it seems to use the qemu defaults,
> thus using the pagecache in the host.    

Christoph is right, QE did not use cache=off in the above command. Please *always* do that.

Comment 14 Dor Laor 2010-02-18 08:36:22 UTC

We reverted the performance enhancements in kvm-83-152.el5 so, the bug will go through modified -> on QE

Comment 15 lihuang 2010-03-01 09:23:06 UTC

(In reply to comment #14)
> We reverted the performance enhancements in kvm-83-152.el5 so, the bug will go
> through modified -> on QE    

Yes. The problem is gone in kvm >152.
(the correct status for the bug is ? VERIFIED ? NOTABUG ? )

Comment 16 errata-xmlrpc 2010-03-30 07:51:50 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0271.html

Note You need to log in before you can comment on or make changes to this bug.