1332221 – Bad performance in fsyncs for lvm raid1

Bug 1332221 - Bad performance in fsyncs for lvm raid1

Summary: Bad performance in fsyncs for lvm raid1

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	LVM and device-mapper
Classification:	Community
Component:	device-mapper
Sub Component:
Version:	2.02.140
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Heinz Mauelshagen
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-05-02 14:21 UTC by viric
Modified:	2019-09-03 16:28 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-09-03 16:28:59 UTC
Embargoed:
Flags:	rule-engine: lvm-technical-solution?

Attachments	(Terms of Use)
blktrace of slowest raid1 + 2 pvs (65.15 KB, application/x-bzip) 2016-05-02 18:29 UTC, viric	no flags	Details
blktrace of fsync test, just created/formatted/mounted (240.37 KB, application/x-bzip) 2016-05-02 20:40 UTC, viric	no flags	Details
blktrace of fsync test, just rebooted and mounted it again (92.45 KB, application/x-bzip) 2016-05-02 20:42 UTC, viric	no flags	Details
blktrace in system2 of a dd direct bs=8k that runs at 89kB/s on the lvm raid1 (204.42 KB, application/x-bzip) 2016-05-03 21:56 UTC, viric	no flags	Details
perf stat -B in before/after recreation (2.93 KB, text/plain) 2016-05-04 15:31 UTC, viric	no flags	Details
perf stat -a of dd before/after recreation, and of sleep 9 (3.84 KB, text/plain) 2016-05-04 15:47 UTC, viric	no flags	Details
View All

Description viric 2016-05-02 14:21:49 UTC

Description of problem:

Running the pg_test_fsync (postgress utility to test fs fsync speed) I get unexpectedly slow results on lvm raid1.

Specifically, I get between 1.5 and 3 fsync/s in a 1TB lvm raid1 (-m1 on lvcreate, created one or two years ago) with ext4, activated and mounted on boot.

Testing in a just created small raid1 (also -m1) in a separate VG, I get ~105fsync/s. If I reboot the system, and test again, I get ~15fsync/s in the very same fs.

The same system has some mdraid1 devices, and they work at >100fsync/s always.

--------------------

Version-Release number of selected component (if applicable):

Linux 4.4.6
Linux 3.18.29
lvm2 2.02.140

How reproducible:

Always.

Steps to Reproduce:
1. Create vg/lv (lvcreate -n testlv -m 1 -l 120 test)
2. mkfs.ext4 /dev/test/testlv
3. mount /dev/test/testlv /mnt/test
4. pg_test_fsync -f /mnt/test/testfile

Expected results (steps 1-4 in linux 3.18):
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                       107.786 ops/sec    9278 usecs/op
        fdatasync                           114.462 ops/sec    8737 usecs/op
        fsync                               104.928 ops/sec    9530 usecs/op
        fsync_writethrough                            n/a
        open_sync                           105.290 ops/sec    9498 usecs/op


Actual results (Only steps 3-4 in linux 3.18 after reboot. Same for linux 4.4.6):
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                        13.828 ops/sec   72316 usecs/op
        fdatasync                            14.512 ops/sec   68907 usecs/op
        fsync                                12.502 ops/sec   79984 usecs/op
        fsync_writethrough                            n/a
        open_sync                            12.814 ops/sec   78040 usecs/op


My biggest problem is not with the testlv, but another ext4 on 1TB lvm raid1, which reports this for the same test. It is on faster hardware, but I really cannot destroy this array to test creating it too:
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                         9.424 ops/sec  106110 usecs/op
        fdatasync                            10.850 ops/sec   92167 usecs/op
        fsync                                 3.347 ops/sec  298755 usecs/op
        fsync_writethrough                            n/a
        open_sync                             3.290 ops/sec  303942 usecs/op

Extra data:

# lvdisplay

  --- Logical volume ---
  LV Path                /dev/teraraid/teraraid_main
  LV Name                teraraid_main
  VG Name                teraraid
  LV UUID                lNtPhi-Ahrg-0utX-Gkhf-EcFy-asgM-EpLt0w
  LV Write Access        read/write
  LV Creation host, time atom, 2013-07-25 10:35:13 +0200
  LV Status              available
  # open                 1
  LV Size                931,50 GiB
  Current LE             238464
  Mirrored volumes       2
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           254:4
   
  --- Logical volume ---
  LV Path                /dev/test/testlv
  LV Name                testlv
  VG Name                test
  LV UUID                zCInyQ-WQGy-jGmd-TI3X-sTyO-9VOc-sUj81J
  LV Write Access        read/write
  LV Creation host, time atom, 2016-05-02 16:00:50 +0200
  LV Status              available
  # open                 1
  LV Size                480,00 MiB
  Current LE             120
  Mirrored volumes       2
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           254:9

# lvs
  LV            VG       Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  teraraid_main teraraid Rwi-aor--- 931,50g                                    100,00          
  testlv        test     rwi-aor--- 480,00m                                    100,00          

[root@atom:/mnt]# dmsetup table
test-testlv: 0 983040 raid raid1 3 0 region_size 1024 2 254:5 254:6 254:7 254:8
teraraid-teraraid_main: 0 1953497088 raid raid1 3 0 region_size 1024 2 254:0 254:1 254:2 254:3
test-testlv_rimage_1: 0 983040 linear 8:22 10240
teraraid-teraraid_main_rimage_1: 0 1953497088 linear 8:33 10240
test-testlv_rimage_0: 0 983040 linear 8:2 10240
teraraid-teraraid_main_rimage_0: 0 1953497088 linear 8:49 10240
test-testlv_rmeta_1: 0 8192 linear 8:22 2048
teraraid-teraraid_main_rmeta_1: 0 8192 linear 8:33 2048
test-testlv_rmeta_0: 0 8192 linear 8:2 2048
teraraid-teraraid_main_rmeta_0: 0 8192 linear 8:49 2048

[root@atom:/mnt]# dmsetup status
test-testlv: 0 983040 raid raid1 2 AA 983040/983040 idle 0
teraraid-teraraid_main: 0 1953497088 raid raid1 2 AA 1953497088/1953497088 idle 0
test-testlv_rimage_1: 0 983040 linear 
teraraid-teraraid_main_rimage_1: 0 1953497088 linear 
test-testlv_rimage_0: 0 983040 linear 
teraraid-teraraid_main_rimage_0: 0 1953497088 linear 
test-testlv_rmeta_1: 0 8192 linear 
teraraid-teraraid_main_rmeta_1: 0 8192 linear 
test-testlv_rmeta_0: 0 8192 linear 
teraraid-teraraid_main_rmeta_0: 0 8192 linear 

# cat /proc/partitions 
major minor  #blocks  name

   8        0  244198584 sda
   8        1   52428800 sda1
   8        2     524288 sda2
   8        3  104857600 sda3
   8        4   86386872 sda4
   8       16  312571224 sdb
   8       17   52428800 sdb1
   8       18          1 sdb2
   8       19  104857600 sdb3
   8       20   86386872 sdb4
   8       21     524288 sdb5
   8       22     524288 sdb6
   8       32  976762584 sdc
   8       33  976761543 sdc1
   8       48  976762584 sdd
   8       49  976761543 sdd1
   8       64  312571224 sde
   8       65  312570200 sde1
   9      127  312569040 md127
   9      126   86385776 md126
   9      125  104856504 md125
   9      124   52428736 md124
 254        0       4096 dm-0
 254        1  976748544 dm-1
 254        2       4096 dm-2
 254        3  976748544 dm-3
 254        4  976748544 dm-4
 254        5       4096 dm-5
 254        6     491520 dm-6
 254        7       4096 dm-7
 254        8     491520 dm-8
 254        9     491520 dm-9

Comment 1 viric 2016-05-02 14:26:22 UTC

In the original post I missed reporting that in 4.4.6 all matches to 3.18. >100fsync/s if just created-formatted-mounted, and ~12fsync/s if activated-mounted on boot.

At some point "lvcreate -n testlv -m 1 -l 120 test" didn't work for me in 4.4.6, and after booting 3.18 it worked. But maybe it was some broken dm state of that moment where I had tested some mdraids. That about not being able to lvcreate would be another report, and I can't repoduce it anymore.

Comment 2 viric 2016-05-02 14:33:04 UTC

More information:
All sd devices report a speed of ~90MB/s in "hdparm -t", and so do the raid devices (md and lvmraid1).

I don't have complaints for the bulk read speed... but the fsyncs/s have a really big difference.

Comment 3 viric 2016-05-02 14:49:07 UTC

I also notice a degradation of performance in mdraid1 devices, depending whether they are assembled on boot (~30fsync/s) or just created (105fsync/s). On Linux 4.4.6.

Comment 4 viric 2016-05-02 16:27:19 UTC

To give an idea of the fsync test of pg_test_fsync, here is an strace -tts:

1462206137.054947 fsync(3)              = 0
1462206137.333041 lseek(3, 0, SEEK_SET) = 0
1462206137.333223 write(3, "\v\341\32\34\177#\370)\370\244\33\23\265\312N\350\23028\340yM=4\274_Nw\372\313l\5"..., 8192) = 8192
1462206137.333560 fsync(3)              = 0
1462206137.612692 lseek(3, 0, SEEK_SET) = 0
1462206137.613141 write(3, "\v\341\32\34\177#\370)\370\244\33\23\265\312N\350\23028\340yM=4\274_Nw\372\313l\5"..., 8192) = 8192
1462206137.613552 fsync(3)              = 0
1462206137.904430 lseek(3, 0, SEEK_SET) = 0
1462206137.904980 write(3, "\v\341\32\34\177#\370)\370\244\33\23\265\312N\350\23028\340yM=4\274_Nw\372\313l\5"..., 8192) = 8192
1462206137.905273 fsync(3)              = 0
1462206138.183157 lseek(3, 0, SEEK_SET) = 0

It shouls around 3fsync/s (that's the created-in-2013 1tb teraraid ext4), the slowest in the system.

Comment 5 viric 2016-05-02 16:36:47 UTC

I just tested the io schedulers for the teraraid two disks (sdc/sdd):

noop = much slower: 1,2 fsync/s
deadline = like ~3,5 fsync/s
cfq (the default) = like ~3,5 fsync/s

Comment 6 viric 2016-05-02 18:29:50 UTC

Created attachment 1153010 [details]
blktrace of slowest raid1 + 2 pvs

When pg_test_fsync started doing the fsync() test, I run the blktrace with the three devices (raid+2pvs). I ctrl-c'ed the blktrace before the fsync test stopped.

It might have run by around 3 seconds.

Comment 7 viric 2016-05-02 20:40:05 UTC

Created attachment 1153036 [details]
blktrace of fsync test, just created/formatted/mounted

I just created the testlv, and the performance is ok: 96 fsync/s. About 100. That's the performance I would like always.

Comment 8 viric 2016-05-02 20:42:50 UTC

Created attachment 1153037 [details]
blktrace of fsync test, just rebooted and mounted it again

I umounted the ext4 of the report of the previous patch, and rebooted the system.

After boot, I checked that there is no IO activity in the drives, I mounted it again, run the test, and I get 11 fsync/s. 10x drop in performance.

The trace, like before, includes three block devices: the raid, and the two pvs.

Comment 9 viric 2016-05-02 21:19:20 UTC

kabi helped on irc getting more information.

After reboot (~15fsync/s) f I enable writemostly and then disable it in every pv of the raid1, and lvchange --refresh, then I get back 100fsync/s.

I tried the same in the old 1tb raid1, and it rose from 3.5fsync/s to 17fsync/s. Still very low, but a big improvement.

kabi suspects of "some  setting  incorrectly set in dm_raid wrapper which gets 'magically' fixed  when manipulating  with writemostly"

(all these tested on linux 4.4.6)

Comment 10 viric 2016-05-02 21:21:29 UTC

I have to note that if I enabled writemostly to the mirror (not origin), the benchmark didn't rise until I run a "lvchange --refresh" of the lv. Disabling writemostly + refresh made it drop down again to slow fsync/s.

If I enabled writemostly to the origin (not mirror), "lvchange --refresh" was not needed and the fsync/s were already high. And disabling this writemostly later did not make the fsync/s drop down again.

Comment 11 Heinz Mauelshagen 2016-05-03 11:29:37 UTC

To make sure this _is_ raid1 related: are you able to second the same drop on a linear LV?

Comment 12 Heinz Mauelshagen 2016-05-03 11:31:08 UTC

Does "lvchange --refresh ..." on the raid1 LV even help w/o writemostly?

Comment 13 viric 2016-05-03 12:13:29 UTC

No, lvchange --refresh does not help after boot, without writemostly.

I will test the linear LV just-created and after-reboot.

Comment 14 Heinz Mauelshagen 2016-05-03 12:22:13 UTC

I tested your raid1+ext4+pg_test_fsync on a Fedora 23 VM
(kernel 4.4.7-300.fc23.x86_64, postgresql-contrib 9.4.7-1.fc23) on both SSD and iSCSI based host backing store and did not see such variance.

Let's find out what is going on about your disks:
- anything in the kernel log related?
- in case they support SMART: anything "smartctl -a .." unveils on them?

Comment 15 viric 2016-05-03 12:44:08 UTC

I didn't think that it could be the disks, because I can:

1) boot the system, and test that it runs slow in the testlv raid with ext4
2) destroy it, create it anew, format ext4
3) test the speed, and it runs fast.

But if I boot the system again, and run the test, it runs slow again.

I really could not see anything special in the dmesg. I will check smart, but all the disks work at 90MB/s without trouble with "hdparm -t".

Comment 16 Heinz Mauelshagen 2016-05-03 12:54:16 UTC

(In reply to viric from comment #15)
> I didn't think that it could be the disks, because I can:
> 
> 1) boot the system, and test that it runs slow in the testlv raid with ext4
> 2) destroy it, create it anew, format ext4
> 3) test the speed, and it runs fast.
> 
> But if I boot the system again, and run the test, it runs slow again.
> 
> I really could not see anything special in the dmesg. I will check smart,
> but all the disks work at 90MB/s without trouble with "hdparm -t".

We need to walk up-stack and make sure that all components involved underneath pg_test_fsync are sane.

Comment 17 viric 2016-05-03 12:59:16 UTC

Just to discard the userland... I can supply a strace -T:

...
fsync(3)                                = 0 <0.282584>
lseek(3, 0, SEEK_SET)                   = 0 <0.000080>
write(3, "\v\341\32\34\177#\370)\370\244\33\23\265\312N\350\23028\340yM=4\274_Nw\372\313l\5"..., 8192) = 8192 <0.000127>
fsync(3)                                = 0 <0.599434>
lseek(3, 0, SEEK_SET)                   = 0 <0.000045>
write(3, "\v\341\32\34\177#\370)\370\244\33\23\265\312N\350\23028\340yM=4\274_Nw\372\313l\5"..., 8192) = 8192 <0.000134>
fsync(3)                                = 0 <0.274169>
lseek(3, 0, SEEK_SET)                   = 0 <0.000199>
write(3, "\v\341\32\34\177#\370)\370\244\33\23\265\312N\350\23028\340yM=4\274_Nw\372\313l\5"..., 8192) = 8192 <0.000281>
fsync(3)                                = 0 <0.273742>
...

So, the times are definitely in fsync(). (this is on ext4-over-teraraid_main)

Comment 18 viric 2016-05-03 13:02:13 UTC

Here are the smart attributes of sdc and sdd (those of the slowest raid), 1TB each. Health: PASSED and No error log in both.

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   113   099   006    Pre-fail  Always       -       56640040
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       15
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   071   060   030    Pre-fail  Always       -       13653903
  9 Power_On_Hours          0x0032   073   073   000    Old_age   Always       -       24284
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       15
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   069   061   045    Old_age   Always       -       31 (Min/Max 28/32)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       10
193 Load_Cycle_Count        0x0032   087   087   000    Old_age   Always       -       26872
194 Temperature_Celsius     0x0022   031   040   000    Old_age   Always       -       31 (0 23 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       1412h+10m+07.992s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       94230993838
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       277241169063


[root@atom:~]# smartctl -A /dev/sdc
smartctl 6.4 2015-11-07 r4167 [x86_64-linux-4.4.6] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   112   099   006    Pre-fail  Always       -       46101008
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       15
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   071   060   030    Pre-fail  Always       -       13179346
  9 Power_On_Hours          0x0032   073   073   000    Old_age   Always       -       24363
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       15
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   072   063   045    Old_age   Always       -       28 (Min/Max 26/28)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       10
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       1149958
194 Temperature_Celsius     0x0022   028   040   000    Old_age   Always       -       28 (0 20 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       10778h+17m+46.319s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       116734019346
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       164927456422

Comment 19 Heinz Mauelshagen 2016-05-03 13:17:26 UTC

Got to analyzing your blktraces as of comment 6-8; neither of them show concerning delays. The 95 fsync/s blktrace even shows more disk delays over 20ms than your 11 fsync/s one:

--- blktrace-lvmraid1-sda2-sdb6-11fync_per_s-afterboot.txt ---
Line 2877 shows delay=0.087836 ->   8,22   0       33     0.279400236     0  C  WS 489453568 [0]
Line 24985 shows delay=0.020364 ->   8,22   0      282     1.886307724     0  C  WS 0 [0]
Processed 28493 lines

--- blktrace-lvmraid1-sda2-sdb6-95fync_per_s.txt ---
Line 27761 shows delay=0.025947 ->   8,2    2      198     0.890765808   182  P   N [md126_raid1]
Line 28749 shows delay=0.046491 ->   8,2    3      271     0.949888466 12436  C  WS 0 [0]
Line 29189 shows delay=0.044012 ->   8,22   1      448     1.003416511     0  C  WS 0 [0]
Line 34453 shows delay=0.025944 ->   8,22   0      605     1.257287095 12434  C  WS 0 [0]
Line 49493 shows delay=0.022982 ->   8,22   2      441     1.771404906 12436  C  WS 0 [0]
Line 68477 shows delay=0.022528 ->   8,22   2      737     2.381291379     0  C  WS 0 [0]
Line 71669 shows delay=0.037130 ->   8,22   3     1378     2.530514890 12666  C  WS 0 [0]
Line 72273 shows delay=0.026166 ->   8,22   2      766     2.578556228     0  C  WS 0 [0]
Line 76149 shows delay=0.025658 ->   8,22   0      825     2.731376221 12441  C  WS 0 [0]
Processed 80777 lines

--- blktrace-raid-sdc-sdd-fsync.txt ---
Line 10921 shows delay=0.289064 ->   8,32   3      162     1.539932025     0  C  WS 0 [0]
No major/minor found
Processed 19237 lines


You mentioned that you did not let the blktrace run to the end, so the cause may not be in the trace-

Comment 20 viric 2016-05-03 13:19:50 UTC

Making testlv (sda2+sdb6) linear raid, and testing there the fsyncs, I get 116fsync/s after creation+format. Same after boot+mount (117fsync/s).

Sorry, about blktrace I don't understand... can I supply something more useful? I just run blktrace during the fsync phase of the test, just to make the file smaller.

Comment 21 Heinz Mauelshagen 2016-05-03 13:31:22 UTC

(In reply to viric from comment #20)
> Making testlv (sda2+sdb6) linear raid, and testing there the fsyncs, I get
> 116fsync/s after creation+format. Same after boot+mount (117fsync/s).

So you don't see the drop after reboot on a linear LV?

> 
> Sorry, about blktrace I don't understand... can I supply something more
> useful? I just run blktrace during the fsync phase of the test, just to make
> the file smaller.

Never mind when the traces cover the complete fsync phase.
As mentioned in my comment #19, blktrace does not unveil a reason
in the block layer for the factor 10 drop you report.

Comment 22 viric 2016-05-03 13:34:58 UTC

(In reply to Heinz Mauelshagen from comment #21)
> (In reply to viric from comment #20)
> So you don't see the drop after reboot on a linear LV?

Exactly. No drop.

As for blktrace, I understand that you do not request me more traces.

Comment 23 viric 2016-05-03 13:37:12 UTC

Notice that in comment #3 I report that I see the same phenomena in mdraid, 100fsync/s (just created+formated+mounted) dropping to 30fsync/s (reboot+mount).

Comment 24 Heinz Mauelshagen 2016-05-03 13:48:57 UTC

What are the results if you stripe across your 2 PVs (lvcreate -L1T -I4k ...)?
No drops after reboot either?

Comment 25 viric 2016-05-03 14:04:05 UTC

In the test volume, I only have 254 extents.

# lvcreate -n testlv -I4k -i 2 -l 240 test  
  Logical volume "testlv" already exists in volume group "test"

# mkfs.ext4 /dev/test/testlv 
...

# mount /dev/test/testlv raidtest/
# pg_test_fsync  -f raidtest/prova
...
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                       118.950 ops/sec    8407 usecs/op
        fdatasync                           115.677 ops/sec    8645 usecs/op
        fsync                               111.675 ops/sec    8955 usecs/op


---------- Then I reboot
# mount /dev/test/testlv raidtest/
# pg_test_fsync  -f raidtest/prova
        open_datasync                       116.382 ops/sec    8592 usecs/op
        fdatasync                           119.655 ops/sec    8357 usecs/op
        fsync                                43.608 ops/sec   22932 usecs/op


(I tested the commands above more than once, and the numbers are reliably those. I checked that there was not any other io activity with iostat, too)

Comment 26 Heinz Mauelshagen 2016-05-03 14:06:22 UTC

(In reply to viric from comment #25)
> In the test volume, I only have 254 extents.
> 
> # lvcreate -n testlv -I4k -i 2 -l 240 test  
>   Logical volume "testlv" already exists in volume group "test"
> 

^^^  failed to create testlv as striped, please remove and repeat.

Comment 27 viric 2016-05-03 14:16:19 UTC

Oh sorry. I just noticed that lvcreate failed. The testlv there was the result of:

# lvcreate -n testlv -I4k -l 240 test
  Ignoring stripesize argument with single stripe.
WARNING: ext4 signature detected on /dev/test/testlv at offset 1080. Wipe it? [y/n]: ^C  Interrupted...

  Aborted wiping of ext4.
  1 existing signature left on the device.
  Logical volume "testlv" created.

[root@atom:/mnt]# lvcreate -n testlv -I4k -i 2 -l 240 test  
  Logical volume "testlv" already exists in volume group "test"

(So, without "-i 2", which I don't know if it was required). Let me test again:
-----------------------

[root@atom:/mnt]# lvcreate -n testlv -I4k -i 2 -l 240 test  
WARNING: ext4 signature detected on /dev/test/testlv at offset 1080. Wipe it? [y/n]: y
  Wiping ext4 signature on /dev/test/testlv.
  Logical volume "testlv" created.
[root@atom:/mnt]# mkfs.ext4 /dev/test/testlv 
...
[root@atom:/mnt]# mount /dev/test/testlv raidtest/
[root@atom:/mnt]# pg_test_fsync -f raidtest/prova
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                       112.690 ops/sec    8874 usecs/op
        fdatasync                           118.654 ops/sec    8428 usecs/op
        fsync                                29.753 ops/sec   33610 usecs/op
        fsync_writethrough                            n/a
        open_sync                            27.871 ops/sec   35880 usecs/op

So, in this case, it has slow fsyncs even without reboot. (checked more than once, reliable results).

-------------- After reboot

[root@atom:/mnt]# mount /dev/test/testlv raidtest/
[root@atom:/mnt]# pg_test_fsync -f raidtest/prova
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                       119.843 ops/sec    8344 usecs/op
        fdatasync                           119.526 ops/sec    8366 usecs/op
        fsync                                30.489 ops/sec   32798 usecs/op
        fsync_writethrough                            n/a
        open_sync                            29.908 ops/sec   33436 usecs/op

(reliable results, test run more than once)

Comment 28 Heinz Mauelshagen 2016-05-03 14:39:53 UTC

I think we sorted out any HW bottlenecks with those results.

As mentioned, I'm not seconding your results with the 4.4.7 Fedora 23 kernel
here, which performs good after reboot with the pg_test_fsync runs.

Try if the newer kernel works for you.

Comment 29 viric 2016-05-03 15:47:55 UTC

I have just set 4.4.8. I had created the raid1 testlv before booting it.
Linux atom 4.4.8 #1-NixOS SMP Wed Apr 20 06:44:02 UTC 2016 x86_64 GNU/Linux

After reboot, on raid1 testlv:
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                        14.915 ops/sec   67045 usecs/op
        fdatasync                            14.534 ops/sec   68802 usecs/op
        fsync                                12.385 ops/sec   80743 usecs/op
        fsync_writethrough                            n/a
        open_sync                            10.391 ops/sec   96241 usecs/op


If I destroy and create again (as in comment #1), I'm back at 100fsync/s. Exactly the same as 4.4.6.

Comment 30 viric 2016-05-03 16:00:20 UTC

Remember that I could reproduce it in 3.18 too.

Comment 31 Heinz Mauelshagen 2016-05-03 16:13:22 UTC

Yes, likely something specific to your configuration unless we can come up with a reproducer here.

Comment 32 viric 2016-05-03 16:20:44 UTC

I have no idea what to look at. If you have idea, I can provide the data.

I will try to reproduce it in a new system.

Comment 33 viric 2016-05-03 17:09:13 UTC

I have created a lv raid1 in a new system (pvs sda2 and sdb2), about 500MB big. Running linux 4.4.7 x86_64. I see, just after creation, 40fsync/s. If I set it linear, 110fsync/s.

I notice that I can read it with dd at 100MB/s. But writing (ctrl-c after some seconds):

# dd if=/dev/zero of=/dev/test/testlv oflag=direct bs=8K
dd: error en escriure '/dev/test/testlv': No space left on device
61441+0 registres llegits
61440+0 registres escrits
503316480 bytes (503 MB, 480 MiB) copied, 45,43 s, 11,1 MB/s

I didn't try rebooting, because it runs at awful speed even just after creating, in this system.

Comment 34 Heinz Mauelshagen 2016-05-03 17:50:35 UTC

No new ideas here so far.

With 8K block size, writing will be inefficient anyway.
Use larger block size.

Comment 35 viric 2016-05-03 20:48:13 UTC

Ok.

Create the lvm raid1, and run: dd oflag=direct bs=5M:  44MB/s.

I reboot and run the same dd: 37,6MB/s

Both results seem close, but the dd definitely reports 33,6MB/s exacts, in 3 tests in a row.

If I mkfs.ext4 on that testlv, and mount it (I created the lv in the previous boot, not this), I get 15 fsync/s as before. So it's not the ext4.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                        14.150 ops/sec   70671 usecs/op
        fdatasync                            14.922 ops/sec   67014 usecs/op
        fsync                                12.943 ops/sec   77262 usecs/op
        fsync_writethrough                            n/a
        open_sync                            11.788 ops/sec   84833 usecs/op

Comment 36 viric 2016-05-03 21:07:50 UTC

Just tested the conventional lvm raid1, in that 4.4.7 system of comment #33.

Speed just created+format+mount: 40 fsync/s (as reported in comment #33)

Reboot, mount and test: 3,9 fsync/s.

So, I can reproduce it in a 2nd system with completely different hardware (all spindisks, though).

Comment 37 viric 2016-05-03 21:42:00 UTC

Contining from prev comment, now that I have the raid running slow (after-boot situation), I run the dd:

#  dd if=/dev/zero of=/dev/test/testlv oflag=direct bs=8k count=1000
1000+0 registres llegits
1000+0 registres escrits
8192000 bytes (8,2 MB, 7,8 MiB) copied, 96,0817 s, 85,3 kB/s

Something is pretty wrong. No other I/O load in any disk of the computer, I checked. Any run of the dd gives 80-90kB/s. This dd on a just created raid1 (without reboot in the middle) gave 11MB/s.

Comment 38 viric 2016-05-03 21:56:00 UTC

Created attachment 1153591 [details]
blktrace in system2 of a dd direct bs=8k that runs at 89kB/s on the lvm raid1

As I was in the after-boot situation of very-slow-lvm-raid1, I run the dd having started blktrace before, and stopping it after the dd. I hope it helps.

[root@comanegra:~]# blktrace -D /dev/shm/ /dev/sda2 /dev/sdb2 /dev/test/testlv  &
[1] 3989

[root@comanegra:~]# dd if=/dev/zero of=/dev/test/testlv oflag=direct bs=8k count=100
100+0 registres llegits
100+0 registres escrits
819200 bytes (819 kB, 800 KiB) copied, 9,30974 s, 88,0 kB/s

[root@comanegra:~]# fg
blktrace -D /dev/shm/ /dev/sda2 /dev/sdb2 /dev/test/testlv
^C=== sda2 ===
  CPU  0:                  910 events,       43 KiB data
  CPU  1:                 2151 events,      101 KiB data
  CPU  2:                 1191 events,       56 KiB data
  CPU  3:                 5838 events,      274 KiB data
  Total:                 10090 events (dropped 0),      474 KiB data
=== sdb2 ===
  CPU  0:                  703 events,       33 KiB data
  CPU  1:                 2151 events,      101 KiB data
  CPU  2:                 1185 events,       56 KiB data
  CPU  3:                 5362 events,      252 KiB data
  Total:                  9401 events (dropped 0),      441 KiB data
=== dm-9 ===
  CPU  0:                   31 events,        2 KiB data
  CPU  1:                   29 events,        2 KiB data
  CPU  2:                   73 events,        4 KiB data
  CPU  3:                  168 events,        8 KiB data
  Total:                   301 events (dropped 0),       15 KiB data

Comment 39 viric 2016-05-03 22:04:12 UTC

Just to add that it only depends on recreating the raid1... in the previous slow situation, see the following commands:

[root@comanegra:~]# dd if=/dev/zero of=/dev/test/testlv oflag=direct bs=8k count=100
100+0 registres llegits
100+0 registres escrits
819200 bytes (819 kB, 800 KiB) copied, 9,30974 s, 88,0 kB/s

[root@comanegra:~]# systemctl suspend

***** I AWAKE IT FROM SUSPEND *********

[root@comanegra:~]# dd if=/dev/zero of=/dev/test/testlv oflag=direct bs=8k count=100
100+0 registres llegits
100+0 registres escrits
819200 bytes (819 kB, 800 KiB) copied, 9,35214 s, 87,6 kB/s

[root@comanegra:~]# lvremove test/testlv
^[/Do you really want to remove active logical volume testlv? [y/ny    
  Logical volume "testlv" successfully removed

[root@comanegra:~]# lvcreate -n testlv -l 120 -m1 test
  allocation/use_blkid_wiping=1 configuration setting is set while LVM is not compiled with blkid wiping support.
  Falling back to native LVM signature detection.
  Logical volume "testlv" created.

[root@comanegra:~]# dd if=/dev/zero of=/dev/test/testlv oflag=direct bs=8k count=100
100+0 registres llegits
100+0 registres escrits
819200 bytes (819 kB, 800 KiB) copied, 0,0721038 s, 11,4 MB/s

Comment 40 Heinz Mauelshagen 2016-05-04 12:42:39 UTC

Not able to reproduce your slow dd results here:
# perf record dd if=/dev/zero of=/dev/tst/r oflag=direct ibs=8K iflag=fullblock obs=8K count=100                 
100+0 records in
100+0 records out
819200 bytes (819 kB) copied, 0.0584847 s, 14.0 MB/s

Report:
 14.29%  dd       [kernel.vmlinux]  [k] _clear_user                                                                               
  14.29%  dd       [kernel.vmlinux]  [k] get_user_pages_fast                                                                        
   7.14%  dd       [kernel.vmlinux]  [k] __kmalloc                                                                                  
   7.14%  dd       [kernel.vmlinux]  [k] bio_add_page                                                                               
   7.14%  dd       [kernel.vmlinux]  [k] entry_SYSCALL_64_fastpath                                                                  
   7.14%  dd       [kernel.vmlinux]  [k] finish_task_switch                                                                         
   7.14%  dd       [kernel.vmlinux]  [k] get_page_from_freelist                                                                     
   7.14%  dd       [md_mod]          [k] bitmap_get_counter                                                                         
   7.14%  dd       [raid1]           [k] raid1_make_request                                                                         
   7.14%  dd       libc-2.22.so      [.] __GI___libc_write                                                                          
   7.14%  dd       libc-2.22.so      [.] __fprintf_chk                                                                              
   7.14%  dd       libc-2.22.so      [.] _dl_addr

Comment 41 viric 2016-05-04 15:17:11 UTC

# perf record dd if=/dev/zero of=/dev/test/testlv oflag=direct bs=8k count=100
100+0 registres llegits
100+0 registres escrits
819200 bytes (819 kB, 800 KiB) copied, 9,59411 s, 85,4 kB/s
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.014 MB perf.data (78 samples) ]

# Overhead  Command  Shared Object      Symbol                          
# ........  .......  .................  ................................
#
    24.66%  dd       libc-2.23.so       [.] __strcmp_sse2               
     5.51%  dd       [kernel.kallsyms]  [k] dequeue_task_fair           
     4.58%  dd       [kernel.kallsyms]  [k] __clear_user                
     3.24%  dd       [kernel.kallsyms]  [k] find_busiest_group          
     3.18%  dd       libc-2.23.so       [.] add_module.isra.1           
     2.84%  dd       libc-2.23.so       [.] __tsearch                   
     2.83%  dd       [kernel.kallsyms]  [k] unlock_page                 
     2.76%  dd       [kernel.kallsyms]  [k] syscall_trace_enter_phase1  
     2.71%  dd       [kernel.kallsyms]  [k] copy_page                   
     2.51%  dd       libc-2.23.so       [.] __GI___libc_write           
     2.45%  dd       [kernel.kallsyms]  [k] ___slab_alloc               
     2.45%  dd       [kernel.kallsyms]  [k] idle_cpu                    
     2.45%  dd       [md_mod]           [k] 0x000000000000653d          
     2.24%  dd       [kernel.kallsyms]  [k] mempool_free                
     2.07%  dd       [kernel.kallsyms]  [k] __msecs_to_jiffies          
     2.07%  dd       [kernel.kallsyms]  [k] prepare_to_wait_event       
     1.95%  dd       libc-2.23.so       [.] __memcmp_sse2               
     1.70%  dd       libc-2.23.so       [.] __gconv_read_conf           
     1.57%  dd       [kernel.kallsyms]  [k] _raw_spin_lock_irqsave      
     1.54%  dd       [kernel.kallsyms]  [k] __wake_up_common            
     1.54%  dd       [kernel.kallsyms]  [k] bio_advance                 
     1.54%  dd       [kernel.kallsyms]  [k] dio_bio_complete            
     1.54%  dd       [kernel.kallsyms]  [k] mempool_alloc               
     1.54%  dd       [kernel.kallsyms]  [k] resched_curr                
     1.52%  dd       [kernel.kallsyms]  [k] _raw_spin_lock              
     1.52%  dd       [kernel.kallsyms]  [k] _raw_write_unlock_irqrestore
     1.52%  dd       [kernel.kallsyms]  [k] cmpxchg_double_slab.isra.61 
     1.52%  dd       [kernel.kallsyms]  [k] schedule                    
     1.52%  dd       [md_mod]           [k] 0x0000000000010925          
     1.52%  dd       ld-2.23.so         [.] check_match                 
     1.52%  dd       libc-2.23.so       [.] __memcpy_sse2               
     1.52%  dd       libc-2.23.so       [.] memchr                      
     1.50%  dd       [kernel.kallsyms]  [k] __do_page_fault             
     1.47%  dd       libc-2.23.so       [.] malloc                      
     1.38%  dd       [kernel.kallsyms]  [k] __mnt_is_readonly.part.8    
     0.71%  dd       [kernel.kallsyms]  [k] __blockdev_direct_IO        
     0.59%  dd       [kernel.kallsyms]  [k] page_fault                  
     0.57%  dd       [kernel.kallsyms]  [k] schedule_timeout            
     0.06%  dd       [kernel.kallsyms]  [k] blk_check_plugged           
     0.05%  dd       [kernel.kallsyms]  [k] clear_buddies               
     0.05%  dd       [kernel.kallsyms]  [k] do_close_on_exec            
     0.01%  dd       [kernel.kallsyms]  [k] native_write_msr_safe       
     0.00%  perf     [kernel.kallsyms]  [k] native_write_msr_safe

Comment 42 viric 2016-05-04 15:26:20 UTC

Now the same with "perf record -a":

# Total Lost Samples: 0
#
# Samples: 11K of event 'cycles'
# Event count (approx.): 665454046
#
# Overhead  Command        Shared Object           Symbol                                   
# ........  .............  ......................  .........................................
#
    76.24%  swapper        [kernel.kallsyms]       [k] native_safe_halt                     
     0.48%  dd             libc-2.23.so            [.] __strcmp_sse2                        
     0.44%  swapper        [kernel.kallsyms]       [k] ahci_handle_port_interrupt           
     0.33%  swapper        [kernel.kallsyms]       [k] __schedule                           
     0.29%  swapper        [kernel.kallsyms]       [k] update_wall_time                     
     0.29%  swapper        [kernel.kallsyms]       [k] read_tsc                             
     0.23%  swapper        [kernel.kallsyms]       [k] _raw_spin_lock                       
     0.23%  swapper        [kernel.kallsyms]       [k] __switch_to

Comment 43 viric 2016-05-04 15:31:38 UTC

Created attachment 1153915 [details]
perf stat -B in before/after recreation

Comment 44 viric 2016-05-04 15:47:36 UTC

Created attachment 1153925 [details]
perf stat -a of dd before/after recreation, and of sleep 9

I add a perf stat -a (all cpus), and also that of "sleep 9" (more or less the amount of time of the slow dd).

The fast dd manages to do the job in 1613 context switches, sleep 9 takes 2558 context switches (few other processes in the system), and the slow dd takes 12105 context switches.

That's the biggest difference I see.

Comment 45 viric 2016-05-04 17:19:23 UTC

I noticed this about immediate safe mode at boot. Maybe it is relevant?

# dmesg | grep -e 'md:\|logical'
[    1.662122] md: raid6 personality registered for level 6
[    1.662124] md: raid5 personality registered for level 5
[    1.662125] md: raid4 personality registered for level 4
[    1.695356] sd 2:0:0:0: [sdb] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
[    1.695379] sd 1:0:0:0: [sda] 1465149168 512-byte logical blocks: (750 GB/699 GiB)
[    2.039841] md: raid1 personality registered for level 1
[    2.211842] stage-1-init: 1 logical volume(s) in volume group "raid2" now active
[    2.385094] stage-1-init: 1 logical volume(s) in volume group "test" now active
[    2.710032] md: mdX in immediate safe mode
[    2.710176] md: mdX in immediate safe mode

Comment 46 viric 2016-05-04 18:16:55 UTC

Some findings with kabi on irc...

1) In the distro we have "vgchange -ay" at boot. It activated my rootfs (raid2) and test/testlv.

2) If I change that to "vgchange -ay raid2", testlv is not activated at boot, and if I activate it manually after boot, it runs fast. To be noted, "raid2" also runs fast.

3) In the distro we don't have a clear shutdown that deactivates the arrays. That may be the source of all this.

Nevertheless... what makes an array go to normal mode? Tried triggering a check:

[  234.879970] md: data-check of RAID array mdX
[  234.879980] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  234.879986] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[  234.879994] md: using 128k window, over a total of 491520k.
[  240.841787] md: mdX: data-check done.

Yet the array still performs slow after it.

The only way I found to reliably make testlv run fast (normal mode, I guess) is to deactivate it before shutdown, and reboot with "vgchange -ay raid2" (so, not activating it), then activating it manually.

Maybe we miss something in the initrd? But why raid2 (rootfs) works fine, despite having the same message of "immediate safe mode"?

Comment 47 viric 2016-05-04 18:19:59 UTC

Isn't, in fact, this "immediate safe mode" a mode to set the array on shutdown, so you aren't required to deactivate it?

This is what I read from this 2012 blog post: http://neil.brown.name/blog/20120615073245

If so, how to set that mode at shutdown, so the array looks perfectly safe on boot, as if it were deactivated on shutdown?

Comment 48 viric 2016-05-04 19:56:00 UTC

This immediate safe mode seems to hit every raid in our system, not only lvm raid. In the first box I reported about, where mdraid and lvmraid live together, we get this at boot:

[    5.399844] md: md127 in immediate safe mode
[    5.400316] md: md126 in immediate safe mode
[    5.400439] md: md125 in immediate safe mode
[    5.400677] md: md123 in immediate safe mode
[    5.400828] md: md124 in immediate safe mode
[    5.401565] md: mdX in immediate safe mode

No idea what should bring those arrays out of safe mode, or what are we doing special in the distro that they go into that state on boot.

We reproduce it in every system, even in vms.

Comment 49 viric 2016-05-05 13:46:16 UTC

Ok, we solved the "immediate safe mode" at boot, in the distro. We were running a "pkill -9 -v -f @" to ensure that we were not leaving any unwanted process in the initrd...  And that was signaling the kernel PIDs, and if you signal the kernel PID it puts the RAID into safe mode.

That was thought for the big kill at the common shutdown sequences of distros, I guess.

Comment 50 Alasdair Kergon 2016-05-06 00:49:49 UTC

So has that made all the other problems go away?

Comment 51 viric 2016-05-06 07:23:01 UTC

This definitely cleared any difference between just-booted/just-created arrays, both mdraid/lvmraid.

Now I'm hitting another issue in the same box, about low fsync performance (7fsync/s in one lvmraid1), but may be related to the SATA controller. This may be a topic of a new issue, if you want.

The Controller is a Sil 3114, and there are reports in mailing lists about its low performance.

Basically, testing with 1GB writing of dd oflags=direct, I get 55MB/s with single disk writing in that controller, and 10MB/s if raid1.

The controller reports to have a NCQ queue of 0/32, while the other SATA controller in the system reports a queue of 31/32. Swapping a pair of disks controller-wise clearly shows that the NCQ queue depth is related to the controller. But by now I haven't found a way to go over 10MB/s in that raid1.

Thank you for the support.

Comment 52 Patrick Dung 2016-11-26 09:38:29 UTC

Hello,

I have encounter similar slow IOPS issue with SSD.

For my case: I have benchmark with FIO:
Case 1) SSD direct connect to raid card in HBA mode
Case 2) SSD direct connect to motherboard

The IOPS difference is huge.
18000, 600 (R,W) IOPS for case 1
600, 220 (R,W) IOPS for case 2

The only difference I noticed is that the SSD does not have DPO/FUA support in SATA.
It have DPO/FUA support when it is connected to the HBA. (SATA over SAS)

Comment 53 Heinz Mauelshagen 2019-09-03 16:28:59 UTC

(In reply to Patrick Dung from comment #52)
> Hello,
> 
> I have encounter similar slow IOPS issue with SSD.
> 
> For my case: I have benchmark with FIO:
> Case 1) SSD direct connect to raid card in HBA mode
> Case 2) SSD direct connect to motherboard
> 
> The IOPS difference is huge.
> 18000, 600 (R,W) IOPS for case 1
> 600, 220 (R,W) IOPS for case 2
> 
> The only difference I noticed is that the SSD does not have DPO/FUA support
> in SATA.
> It have DPO/FUA support when it is connected to the HBA. (SATA over SAS)

Have you been able to address any HW issues successfully raising IOPS?

Closing the bz, please reopen in case any non-HW issues occur.

Note You need to log in before you can comment on or make changes to this bug.