Bug 1392927 - Time to change the default regionsize for mirror and raid
Summary: Time to change the default regionsize for mirror and raid
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: LVM and device-mapper
Classification: Community
Component: lvm2
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Heinz Mauelshagen
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1394039
TreeView+ depends on / blocked
 
Reported: 2016-11-08 13:51 UTC by Jonathan Earl Brassow
Modified: 2017-05-15 14:10 UTC (History)
7 users (show)

Fixed In Version: 2.02.170
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-05-15 14:10:06 UTC
Embargoed:
heinzm: lvm-technical-solution+
rule-engine: lvm-test-coverage?


Attachments (Terms of Use)

Description Jonathan Earl Brassow 2016-11-08 13:51:38 UTC
Users are experiencing much lower performance from RAID1 created from LVM than from MD.  Tests show this is due to the small regionsize chosen for RAID1 in LVM.

The user can choose larger regionsizes when creating their RAID LV, but they cannot change it later.

This bug is for changing the default regionsize.

Comment 1 Jonathan Earl Brassow 2016-11-08 13:52:30 UTC
For example:
https://www.redhat.com/archives/linux-lvm/2016-November/msg00003.html

Comment 2 Heinz Mauelshagen 2016-11-17 12:00:29 UTC
Might be true for the related use case reported on linux-lvm:
"
I am experiencing a dramatic degradation of the sequential write speed
on a raid1 LV that resides on two USB-3 connected harddisks (UAS
enabled), compared to parallel access to both drives without raid or
compared to MD raid:

- parallel sequential writes LVs on both disks: 140 MB/s per disk
- sequential write to MD raid1 without bitmap: 140 MB/s
- sequential write to MD raid1 with bitmap: 48 MB/s
- sequential write to LVM raid1: 17 MB/s !!
"

Using SATA disks for PVs, this is not reproducable:

[root@o ~]# lvcreate --nosync -R512K -m1 -nr -y -L128G ssd_host 
  WARNING: New raid1 won't be synchronised. Don't read what you didn't write!
  Logical volume "r" created.
[root@o ~]# dd of=/dev/ssd_host/r if=/dev/zero oflag=direct iflag=fullblock bs=1G count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.48014 s, 240 MB/s
[root@o ~]# lvremove -y ssd_host/r
  Logical volume "r" successfully removed
[root@o ~]# lvcreate --nosync -R128M -m1 -nr -y -L128G ssd_host 
  WARNING: New raid1 won't be synchronised. Don't read what you didn't write!
  Logical volume "r" created.
[root@o ~]# dd of=/dev/ssd_host/r if=/dev/zero oflag=direct iflag=fullblock bs=1G count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.48841 s, 239 MB/s

So the users USB stack plays a role, which is a niche use case for RAID.

Comment 3 Heinz Mauelshagen 2016-11-17 12:38:05 UTC
Set up a 2PV VG on USB transprt.
Getting worse performance with larger raid1 regionsize on it
(seqwrite 512k regionsize: 109MB/s, 64m regionsize: 83.8MB/s).

It is this not conclusive to assume larger regionsize always is better:

[root@o ~]# pvs|grep usb
  /dev/sdh       usb_vg   lvm2 a--  238.47g 238.47g
  /dev/sdi       usb_vg   lvm2 a--  238.47g 238.47g
[root@o ~]# vgs usb_vg
  VG     #PV #LV #SN Attr   VSize   VFree  
  usb_vg   2   0   0 wz--n- 476.95g 476.95g
[root@o ~]# lvcreate --nosync -R512k -m1 -nr -y -L1G usb_vg
  WARNING: New raid1 won't be synchronised. Don't read what you didn't write!
  Logical volume "r" created.[root@o ~]# lvs -ao+regionsize,devices usb_vg
[root@o ~]# lvs -ao+regionsize,devices usb_vg
  LV           VG     Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Region  Devices                    
  r            usb_vg Rwi-a-r--- 1.00g                                    100.00           512.00k r_rimage_0(0),r_rimage_1(0)
[root@o ~]# dd of=/dev/usb_vg/r bs=256M count=1 iflag=fullblock oflag=direct if=/dev/zero
1+0 records in
1+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 2.46946 s, 109 MB/s
[root@o ~]# lvremove -y usb_vg
  Logical volume "r" successfully removed
[root@o ~]# lvcreate --nosync -R64m -m1 -nr -y -L1G usb_vg
  WARNING: New raid1 won't be synchronised. Don't read what you didn't write!
  Logical volume "r" created.
[root@o ~]# lvs -ao+regionsize,devices usb_vg
  LV           VG     Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Region Devices                    
  r            usb_vg Rwi-a-r--- 1.00g                                    100.00           64.00m r_rimage_0(0),r_rimage_1(0)
[root@o ~]# dd of=/dev/usb_vg/r bs=256M count=1 iflag=fullblock oflag=direct if=/dev/zero
1+0 records in
1+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 3.20161 s, 83.8 MB/s

Comment 4 Heinz Mauelshagen 2016-11-17 12:48:00 UTC
Optimzed USB config attaching USB disks to seperate controllers
to reduce bandwidth variations found in test as of comment #3 lead
to little thoughput variation on regionsize variations.

We need the exact commands used to get the users results
documented in comment #2.

[root@o ~]# lvcreate --nosync -R512k -m1 -nr -y -L1G usb_vg
  WARNING: New raid1 won't be synchronised. Don't read what you didn't write!
  Logical volume "r" created.
[root@o ~]# dd of=/dev/usb_vg/r bs=256M count=1 iflag=fullblock oflag=direct if=/dev/zero
1+0 records in
1+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 1.64452 s, 163 MB/s
[root@o ~]# lvremove -y usb_vg
  Logical volume "r" successfully removed
[root@o ~]# lvcreate --nosync -R64m -m1 -nr -y -L1G usb_vg
  WARNING: New raid1 won't be synchronised. Don't read what you didn't write!
  Logical volume "r" created.
[root@o ~]# dd of=/dev/usb_vg/r bs=256M count=1 iflag=fullblock oflag=direct if=/dev/zero
1+0 records in
1+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 1.73097 s, 155 MB/s
[root@o ~]# dd of=/dev/usb_vg/r bs=256M count=1 iflag=fullblock oflag=direct if=/dev/zero
1+0 records in
1+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 1.66767 s, 161 MB/s
[root@o ~]# lvremove -y usb_vg
  Logical volume "r" successfully removed
[root@o ~]# lvcreate --nosync -R64k -m1 -nr -y -L1G usb_vg
  WARNING: New raid1 won't be synchronised. Don't read what you didn't write!
  Logical volume "r" created.
[root@o ~]# dd of=/dev/usb_vg/r bs=256M count=1 iflag=fullblock oflag=direct if=/dev/zero
1+0 records in
1+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 1.62996 s, 165 MB/s
[root@o ~]# dd of=/dev/usb_vg/r bs=256M count=1 iflag=fullblock oflag=direct if=/dev/zero^C
[root@o ~]# lvremove -y usb_vg
  Logical volume "r" successfully removed
[root@o ~]# lvcreate --nosync -R8k -m1 -nr -y -L1G usb_vg
  WARNING: New raid1 won't be synchronised. Don't read what you didn't write!
  Logical volume "r" created.
[root@o ~]# dd of=/dev/usb_vg/r bs=256M count=1 iflag=fullblock oflag=direct if=/dev/zero
1+0 records in
1+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 1.67379 s, 160 MB/s
[root@o ~]# dd of=/dev/usb_vg/r bs=256M count=1 iflag=fullblock oflag=direct if=/dev/zero
1+0 records in
1+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 1.65624 s, 162 MB/s

Comment 5 Heinz Mauelshagen 2016-11-17 13:54:56 UTC
(In reply to Heinz Mauelshagen from comment #4)

User commands are in initial mail
"[linux-lvm] very slow sequential writes on lvm raid1 (bitmap?)"
dated 11/7/2016.

I'm not seeing qualifying throughput variations for sequential writes with MD
(w/o and with bitmap) and in case of the latter varying bitmap chunk size
(aka lvm regionsize):

[root@o ~]# mdadm -C /dev/md/r --bitmap=none -l 1 -n 2 /dev/sd[bh]
mdadm: /dev/sdb appears to be part of a raid array:
       level=raid1 devices=2 ctime=Thu Nov 17 14:49:47 2016
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: /dev/sdh appears to be part of a raid array:
       level=raid1 devices=2 ctime=Thu Nov 17 14:49:47 2016
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/r started.
[root@o ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid1]
md127 : active raid1 sdh[1] sdb[0]
      249928000 blocks super 1.2 [2/2] [UU]
      [>....................]  resync =  0.0% (170112/249928000) finish=24.4min speed=170112K/sec

unused devices: <none>
[root@o ~]# dd of=/dev/md/r bs=128M count=1 iflag=fullblock oflag=direct if=/dev/zero
1+0 records in
1+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.851073 s, 158 MB/s
[root@o ~]# dd of=/dev/md/r bs=128M count=1 iflag=fullblock oflag=direct if=/dev/zero
1+0 records in
1+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.825584 s, 163 MB/s
[root@o ~]# dd of=/dev/md/r bs=128M count=1 iflag=fullblock oflag=direct if=/dev/zero
1+0 records in
1+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.830841 s, 162 MB/s
[root@o ~]# dd of=/dev/md/r bs=128M count=1 iflag=fullblock oflag=direct if=/dev/zero
1+0 records in
1+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.856382 s, 157 MB/s
[root@o ~]# mdadm -S /dev/md/r
mdadm: stopped /dev/md/r
[root@o ~]# mdadm -C /dev/md/r --bitmap-chunk=524288 --bitmap=internal -l 1 -n 2 /dev/sd[bh]
mdadm: /dev/sdb appears to be part of a raid array:
       level=raid1 devices=2 ctime=Thu Nov 17 14:50:04 2016
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: /dev/sdh appears to be part of a raid array:
       level=raid1 devices=2 ctime=Thu Nov 17 14:50:04 2016
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/r started.
[root@o ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid1]
md127 : active raid1 sdh[1] sdb[0]
      249928000 blocks super 1.2 [2/2] [UU]
      [>....................]  resync =  0.0% (184192/249928000) finish=22.5min speed=184192K/sec
      bitmap: 1/1 pages [4KB], 524288KB chunk

unused devices: <none>
[root@o ~]# dd of=/dev/md/r bs=128M count=1 iflag=fullblock oflag=direct if=/dev/zero
1+0 records in
1+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.882936 s, 152 MB/s
[root@o ~]# dd of=/dev/md/r bs=128M count=1 iflag=fullblock oflag=direct if=/dev/zero
1+0 records in
1+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.855478 s, 157 MB/s
[root@o ~]# dd of=/dev/md/r bs=128M count=1 iflag=fullblock oflag=direct if=/dev/zero
1+0 records in
1+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.877999 s, 153 MB/s
[root@o ~]# mdadm -S /dev/md/r
mdadm: stopped /dev/md/r
[root@o ~]# mdadm -C /dev/md/r --bitmap-chunk=512 --bitmap=internal -l 1 -n 2 /dev/sd[bh]
mdadm: /dev/sdb appears to be part of a raid array:
       level=raid1 devices=2 ctime=Thu Nov 17 14:50:54 2016
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: /dev/sdh appears to be part of a raid array:
       level=raid1 devices=2 ctime=Thu Nov 17 14:50:54 2016
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/r started.
[root@o ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid1]
md127 : active raid1 sdh[1] sdb[0]
      249928000 blocks super 1.2 [2/2] [UU]
      [>....................]  resync =  0.1% (265280/249928000) finish=15.6min speed=265280K/sec
      bitmap: 239/239 pages [956KB], 512KB chunk

unused devices: <none>
[root@o ~]# dd of=/dev/md/r bs=128M count=1 iflag=fullblock oflag=direct if=/dev/zero
1+0 records in
1+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.87625 s, 153 MB/s
[root@o ~]# dd of=/dev/md/r bs=128M count=1 iflag=fullblock oflag=direct if=/dev/zero
1+0 records in
1+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.839441 s, 160 MB/s
[root@o ~]# dd of=/dev/md/r bs=128M count=1 iflag=fullblock oflag=direct if=/dev/zero
1+0 records in
1+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.830878 s, 162 MB/s
[root@o ~]# dd of=/dev/md/r bs=128M count=1 iflag=fullblock oflag=direct if=/dev/zero
1+0 records in
1+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.825255 s, 163 MB/s

Comment 6 Zdenek Kabelac 2016-11-18 10:08:38 UTC
In my test when I use 200ms write delay on secondary leg (simulation of disk with rather very large latency on seek) there is observable speed-up.

So it likely depends how quickly you can sync  _tmeta writes on your
attached USB storage.

If the USB is quite slow with seek, but still very fast with stream write - increase of region size and reducing frequency of bitmap updates leads to significant speedup.

i.e. in my case very small test case on slower hw (T61) with 200ms dm-delay dev usage:

512K regionsize and 64M write  -> ~67MB/s

32M regionsize and 64M write  -> ~128MB/s


So I guess we need to know disk-types in use in the users' case.
Is there some USB storage available there days giving large streaming bandwidth but poor seek access on small sector writes.

IMHO I'd guess some flash MicroSD cards might be giving this bad performance behavior. Assuming we need to query a user for this.

Comment 7 Leo Bergolth 2016-11-26 23:30:11 UTC
I think the reason is the random IO caused by the bitmap updates and the poor seek times of my 5000 RPM harddisks.

I did my tests with bs=1M oflag=direct while Heinz used bs=1G oflag=direct. This leads to much less bitmap updates (>1000 vs 60 for 1G of data).

I'd expect that those bitmap updates cause two seeks each. This random IO is, of course, very expensive, especially if slow 5000 RPM disks are used...

I've recorded some tests with blktrace. The results can be downloaded from http://leo.kloburg.at/tmp/lvm-raid1-bitmap/


# lsusb -t
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/6p, 5000M
    |__ Port 4: Dev 3, If 0, Class=Hub, Driver=hub/4p, 5000M
        |__ Port 1: Dev 9, If 0, Class=Mass Storage, Driver=uas, 5000M
        |__ Port 2: Dev 8, If 0, Class=Mass Storage, Driver=uas, 5000M

# readlink -f /sys/class/block/sd[bc]/device/
/sys/devices/pci0000:00/0000:00:14.0/usb2/2-4/2-4.2/2-4.2:1.0/host2/target2:0:0/2:0:0:0
/sys/devices/pci0000:00/0000:00:14.0/usb2/2-4/2-4.1/2-4.1:1.0/host3/target3:0:0/3:0:0:0

# echo noop > /sys/block/sdb/queue/scheduler
# echo noop > /sys/block/sdc/queue/scheduler
# pvcreate /dev/sdb3 
# pvcreate /dev/sdc3 
# vgcreate vg_t /dev/sd[bc]3

# lvcreate --type raid1 -m 1 -L30G --regionsize=512k --nosync -y -n lv_t vg_t


# ---------- regionsize 512k, dd bs=1M oflags=direct
# blktrace -d /dev/sdb3 -d /dev/sdc3 -d /dev/vg_t/lv_t -D raid1-512k-reg-direct-bs-1M/
# dd if=/dev/zero of=/dev/vg_t/lv_t bs=1M count=1000 oflag=direct
1048576000 bytes (1,0 GB) copied, 55,7425 s, 18,8 MB/s

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb3              0,00     0,00    0,00   54,00     0,00 18504,00   685,33     0,14    2,52    0,00    2,52   1,70   9,20
sdc3              0,00     0,00    0,00   54,00     0,00 18504,00   685,33     0,14    2,52    0,00    2,52   1,67   9,00
dm-9              0,00     0,00    0,00   18,00     0,00 18432,00  2048,00     1,00   54,06    0,00   54,06  55,39  99,70

# ---------- regionsize 512k, dd bs=1G oflags=direct (like Heinz Mauelshagens test)
# blktrace -d /dev/sdb3 -d /dev/sdc3 -d /dev/vg_t/lv_t -D raid1-512k-reg-direct-bs-1G/
# dd if=/dev/zero of=/dev/vg_t/lv_t bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1,1 GB) copied, 7,3139 s, 147 MB/s

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb3              0,00     0,00    0,00  306,00     0,00 156672,00  1024,00   135,47  441,34    0,00  441,34   3,27 100,00
sdc3              0,00     0,00    0,00  302,00     0,00 154624,00  1024,00   129,46  421,76    0,00  421,76   3,31 100,00
dm-9              0,00     0,00    0,00    0,00     0,00     0,00     0,00   648,81    0,00    0,00    0,00   0,00 100,00


# ---------- regionsize 512k, dd bs=1M conv=fsync
# blktrace -d /dev/sdb3 -d /dev/sdc3 -d /dev/vg_t/lv_t -D raid1-512k-reg-fsync-bs-1M/
# dd if=/dev/zero of=/dev/vg_t/lv_t bs=1M count=1000 conv=fsync
1048576000 bytes (1,0 GB) copied, 7,75605 s, 135 MB/s

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb3              0,00 21971,00    0,00  285,00     0,00 145920,00  1024,00   141,99  540,75    0,00  540,75   3,51 100,00
sdc3              0,00 21971,00    0,00  310,00     0,00 158720,00  1024,00   106,86  429,35    0,00  429,35   3,23 100,00
dm-9              0,00     0,00    0,00    0,00     0,00     0,00     0,00 24561,60    0,00    0,00    0,00   0,00 100,00

Comment 8 Heinz Mauelshagen 2017-02-28 17:28:46 UTC
See https://bugzilla.redhat.com/show_bug.cgi?id=1392947 for upstream
enhancement to change region size on existing RaidLVs.

Comment 9 Heinz Mauelshagen 2017-04-13 20:47:24 UTC
Upstream commit 5ae7a016b8e5796d36cf491345b1cf8e43ec9ea5


Note You need to log in before you can comment on or make changes to this bug.