Description of problem: Running the pg_test_fsync (postgress utility to test fs fsync speed) I get unexpectedly slow results on lvm raid1. Specifically, I get between 1.5 and 3 fsync/s in a 1TB lvm raid1 (-m1 on lvcreate, created one or two years ago) with ext4, activated and mounted on boot. Testing in a just created small raid1 (also -m1) in a separate VG, I get ~105fsync/s. If I reboot the system, and test again, I get ~15fsync/s in the very same fs. The same system has some mdraid1 devices, and they work at >100fsync/s always. -------------------- Version-Release number of selected component (if applicable): Linux 4.4.6 Linux 3.18.29 lvm2 2.02.140 How reproducible: Always. Steps to Reproduce: 1. Create vg/lv (lvcreate -n testlv -m 1 -l 120 test) 2. mkfs.ext4 /dev/test/testlv 3. mount /dev/test/testlv /mnt/test 4. pg_test_fsync -f /mnt/test/testfile Expected results (steps 1-4 in linux 3.18): Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 107.786 ops/sec 9278 usecs/op fdatasync 114.462 ops/sec 8737 usecs/op fsync 104.928 ops/sec 9530 usecs/op fsync_writethrough n/a open_sync 105.290 ops/sec 9498 usecs/op Actual results (Only steps 3-4 in linux 3.18 after reboot. Same for linux 4.4.6): Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 13.828 ops/sec 72316 usecs/op fdatasync 14.512 ops/sec 68907 usecs/op fsync 12.502 ops/sec 79984 usecs/op fsync_writethrough n/a open_sync 12.814 ops/sec 78040 usecs/op My biggest problem is not with the testlv, but another ext4 on 1TB lvm raid1, which reports this for the same test. It is on faster hardware, but I really cannot destroy this array to test creating it too: Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 9.424 ops/sec 106110 usecs/op fdatasync 10.850 ops/sec 92167 usecs/op fsync 3.347 ops/sec 298755 usecs/op fsync_writethrough n/a open_sync 3.290 ops/sec 303942 usecs/op Extra data: # lvdisplay --- Logical volume --- LV Path /dev/teraraid/teraraid_main LV Name teraraid_main VG Name teraraid LV UUID lNtPhi-Ahrg-0utX-Gkhf-EcFy-asgM-EpLt0w LV Write Access read/write LV Creation host, time atom, 2013-07-25 10:35:13 +0200 LV Status available # open 1 LV Size 931,50 GiB Current LE 238464 Mirrored volumes 2 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 254:4 --- Logical volume --- LV Path /dev/test/testlv LV Name testlv VG Name test LV UUID zCInyQ-WQGy-jGmd-TI3X-sTyO-9VOc-sUj81J LV Write Access read/write LV Creation host, time atom, 2016-05-02 16:00:50 +0200 LV Status available # open 1 LV Size 480,00 MiB Current LE 120 Mirrored volumes 2 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 254:9 # lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert teraraid_main teraraid Rwi-aor--- 931,50g 100,00 testlv test rwi-aor--- 480,00m 100,00 [root@atom:/mnt]# dmsetup table test-testlv: 0 983040 raid raid1 3 0 region_size 1024 2 254:5 254:6 254:7 254:8 teraraid-teraraid_main: 0 1953497088 raid raid1 3 0 region_size 1024 2 254:0 254:1 254:2 254:3 test-testlv_rimage_1: 0 983040 linear 8:22 10240 teraraid-teraraid_main_rimage_1: 0 1953497088 linear 8:33 10240 test-testlv_rimage_0: 0 983040 linear 8:2 10240 teraraid-teraraid_main_rimage_0: 0 1953497088 linear 8:49 10240 test-testlv_rmeta_1: 0 8192 linear 8:22 2048 teraraid-teraraid_main_rmeta_1: 0 8192 linear 8:33 2048 test-testlv_rmeta_0: 0 8192 linear 8:2 2048 teraraid-teraraid_main_rmeta_0: 0 8192 linear 8:49 2048 [root@atom:/mnt]# dmsetup status test-testlv: 0 983040 raid raid1 2 AA 983040/983040 idle 0 teraraid-teraraid_main: 0 1953497088 raid raid1 2 AA 1953497088/1953497088 idle 0 test-testlv_rimage_1: 0 983040 linear teraraid-teraraid_main_rimage_1: 0 1953497088 linear test-testlv_rimage_0: 0 983040 linear teraraid-teraraid_main_rimage_0: 0 1953497088 linear test-testlv_rmeta_1: 0 8192 linear teraraid-teraraid_main_rmeta_1: 0 8192 linear test-testlv_rmeta_0: 0 8192 linear teraraid-teraraid_main_rmeta_0: 0 8192 linear # cat /proc/partitions major minor #blocks name 8 0 244198584 sda 8 1 52428800 sda1 8 2 524288 sda2 8 3 104857600 sda3 8 4 86386872 sda4 8 16 312571224 sdb 8 17 52428800 sdb1 8 18 1 sdb2 8 19 104857600 sdb3 8 20 86386872 sdb4 8 21 524288 sdb5 8 22 524288 sdb6 8 32 976762584 sdc 8 33 976761543 sdc1 8 48 976762584 sdd 8 49 976761543 sdd1 8 64 312571224 sde 8 65 312570200 sde1 9 127 312569040 md127 9 126 86385776 md126 9 125 104856504 md125 9 124 52428736 md124 254 0 4096 dm-0 254 1 976748544 dm-1 254 2 4096 dm-2 254 3 976748544 dm-3 254 4 976748544 dm-4 254 5 4096 dm-5 254 6 491520 dm-6 254 7 4096 dm-7 254 8 491520 dm-8 254 9 491520 dm-9
In the original post I missed reporting that in 4.4.6 all matches to 3.18. >100fsync/s if just created-formatted-mounted, and ~12fsync/s if activated-mounted on boot. At some point "lvcreate -n testlv -m 1 -l 120 test" didn't work for me in 4.4.6, and after booting 3.18 it worked. But maybe it was some broken dm state of that moment where I had tested some mdraids. That about not being able to lvcreate would be another report, and I can't repoduce it anymore.
More information: All sd devices report a speed of ~90MB/s in "hdparm -t", and so do the raid devices (md and lvmraid1). I don't have complaints for the bulk read speed... but the fsyncs/s have a really big difference.
I also notice a degradation of performance in mdraid1 devices, depending whether they are assembled on boot (~30fsync/s) or just created (105fsync/s). On Linux 4.4.6.
To give an idea of the fsync test of pg_test_fsync, here is an strace -tts: 1462206137.054947 fsync(3) = 0 1462206137.333041 lseek(3, 0, SEEK_SET) = 0 1462206137.333223 write(3, "\v\341\32\34\177#\370)\370\244\33\23\265\312N\350\23028\340yM=4\274_Nw\372\313l\5"..., 8192) = 8192 1462206137.333560 fsync(3) = 0 1462206137.612692 lseek(3, 0, SEEK_SET) = 0 1462206137.613141 write(3, "\v\341\32\34\177#\370)\370\244\33\23\265\312N\350\23028\340yM=4\274_Nw\372\313l\5"..., 8192) = 8192 1462206137.613552 fsync(3) = 0 1462206137.904430 lseek(3, 0, SEEK_SET) = 0 1462206137.904980 write(3, "\v\341\32\34\177#\370)\370\244\33\23\265\312N\350\23028\340yM=4\274_Nw\372\313l\5"..., 8192) = 8192 1462206137.905273 fsync(3) = 0 1462206138.183157 lseek(3, 0, SEEK_SET) = 0 It shouls around 3fsync/s (that's the created-in-2013 1tb teraraid ext4), the slowest in the system.
I just tested the io schedulers for the teraraid two disks (sdc/sdd): noop = much slower: 1,2 fsync/s deadline = like ~3,5 fsync/s cfq (the default) = like ~3,5 fsync/s
Created attachment 1153010 [details] blktrace of slowest raid1 + 2 pvs When pg_test_fsync started doing the fsync() test, I run the blktrace with the three devices (raid+2pvs). I ctrl-c'ed the blktrace before the fsync test stopped. It might have run by around 3 seconds.
Created attachment 1153036 [details] blktrace of fsync test, just created/formatted/mounted I just created the testlv, and the performance is ok: 96 fsync/s. About 100. That's the performance I would like always.
Created attachment 1153037 [details] blktrace of fsync test, just rebooted and mounted it again I umounted the ext4 of the report of the previous patch, and rebooted the system. After boot, I checked that there is no IO activity in the drives, I mounted it again, run the test, and I get 11 fsync/s. 10x drop in performance. The trace, like before, includes three block devices: the raid, and the two pvs.
kabi helped on irc getting more information. After reboot (~15fsync/s) f I enable writemostly and then disable it in every pv of the raid1, and lvchange --refresh, then I get back 100fsync/s. I tried the same in the old 1tb raid1, and it rose from 3.5fsync/s to 17fsync/s. Still very low, but a big improvement. kabi suspects of "some setting incorrectly set in dm_raid wrapper which gets 'magically' fixed when manipulating with writemostly" (all these tested on linux 4.4.6)
I have to note that if I enabled writemostly to the mirror (not origin), the benchmark didn't rise until I run a "lvchange --refresh" of the lv. Disabling writemostly + refresh made it drop down again to slow fsync/s. If I enabled writemostly to the origin (not mirror), "lvchange --refresh" was not needed and the fsync/s were already high. And disabling this writemostly later did not make the fsync/s drop down again.
To make sure this _is_ raid1 related: are you able to second the same drop on a linear LV?
Does "lvchange --refresh ..." on the raid1 LV even help w/o writemostly?
No, lvchange --refresh does not help after boot, without writemostly. I will test the linear LV just-created and after-reboot.
I tested your raid1+ext4+pg_test_fsync on a Fedora 23 VM (kernel 4.4.7-300.fc23.x86_64, postgresql-contrib 9.4.7-1.fc23) on both SSD and iSCSI based host backing store and did not see such variance. Let's find out what is going on about your disks: - anything in the kernel log related? - in case they support SMART: anything "smartctl -a .." unveils on them?
I didn't think that it could be the disks, because I can: 1) boot the system, and test that it runs slow in the testlv raid with ext4 2) destroy it, create it anew, format ext4 3) test the speed, and it runs fast. But if I boot the system again, and run the test, it runs slow again. I really could not see anything special in the dmesg. I will check smart, but all the disks work at 90MB/s without trouble with "hdparm -t".
(In reply to viric from comment #15) > I didn't think that it could be the disks, because I can: > > 1) boot the system, and test that it runs slow in the testlv raid with ext4 > 2) destroy it, create it anew, format ext4 > 3) test the speed, and it runs fast. > > But if I boot the system again, and run the test, it runs slow again. > > I really could not see anything special in the dmesg. I will check smart, > but all the disks work at 90MB/s without trouble with "hdparm -t". We need to walk up-stack and make sure that all components involved underneath pg_test_fsync are sane.
Just to discard the userland... I can supply a strace -T: ... fsync(3) = 0 <0.282584> lseek(3, 0, SEEK_SET) = 0 <0.000080> write(3, "\v\341\32\34\177#\370)\370\244\33\23\265\312N\350\23028\340yM=4\274_Nw\372\313l\5"..., 8192) = 8192 <0.000127> fsync(3) = 0 <0.599434> lseek(3, 0, SEEK_SET) = 0 <0.000045> write(3, "\v\341\32\34\177#\370)\370\244\33\23\265\312N\350\23028\340yM=4\274_Nw\372\313l\5"..., 8192) = 8192 <0.000134> fsync(3) = 0 <0.274169> lseek(3, 0, SEEK_SET) = 0 <0.000199> write(3, "\v\341\32\34\177#\370)\370\244\33\23\265\312N\350\23028\340yM=4\274_Nw\372\313l\5"..., 8192) = 8192 <0.000281> fsync(3) = 0 <0.273742> ... So, the times are definitely in fsync(). (this is on ext4-over-teraraid_main)
Here are the smart attributes of sdc and sdd (those of the slowest raid), 1TB each. Health: PASSED and No error log in both. === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 113 099 006 Pre-fail Always - 56640040 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 15 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 13653903 9 Power_On_Hours 0x0032 073 073 000 Old_age Always - 24284 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 15 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 069 061 045 Old_age Always - 31 (Min/Max 28/32) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 10 193 Load_Cycle_Count 0x0032 087 087 000 Old_age Always - 26872 194 Temperature_Celsius 0x0022 031 040 000 Old_age Always - 31 (0 23 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 1412h+10m+07.992s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 94230993838 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 277241169063 [root@atom:~]# smartctl -A /dev/sdc smartctl 6.4 2015-11-07 r4167 [x86_64-linux-4.4.6] (local build) Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 112 099 006 Pre-fail Always - 46101008 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 15 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 13179346 9 Power_On_Hours 0x0032 073 073 000 Old_age Always - 24363 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 15 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 072 063 045 Old_age Always - 28 (Min/Max 26/28) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 10 193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 1149958 194 Temperature_Celsius 0x0022 028 040 000 Old_age Always - 28 (0 20 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 10778h+17m+46.319s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 116734019346 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 164927456422
Got to analyzing your blktraces as of comment 6-8; neither of them show concerning delays. The 95 fsync/s blktrace even shows more disk delays over 20ms than your 11 fsync/s one: --- blktrace-lvmraid1-sda2-sdb6-11fync_per_s-afterboot.txt --- Line 2877 shows delay=0.087836 -> 8,22 0 33 0.279400236 0 C WS 489453568 [0] Line 24985 shows delay=0.020364 -> 8,22 0 282 1.886307724 0 C WS 0 [0] Processed 28493 lines --- blktrace-lvmraid1-sda2-sdb6-95fync_per_s.txt --- Line 27761 shows delay=0.025947 -> 8,2 2 198 0.890765808 182 P N [md126_raid1] Line 28749 shows delay=0.046491 -> 8,2 3 271 0.949888466 12436 C WS 0 [0] Line 29189 shows delay=0.044012 -> 8,22 1 448 1.003416511 0 C WS 0 [0] Line 34453 shows delay=0.025944 -> 8,22 0 605 1.257287095 12434 C WS 0 [0] Line 49493 shows delay=0.022982 -> 8,22 2 441 1.771404906 12436 C WS 0 [0] Line 68477 shows delay=0.022528 -> 8,22 2 737 2.381291379 0 C WS 0 [0] Line 71669 shows delay=0.037130 -> 8,22 3 1378 2.530514890 12666 C WS 0 [0] Line 72273 shows delay=0.026166 -> 8,22 2 766 2.578556228 0 C WS 0 [0] Line 76149 shows delay=0.025658 -> 8,22 0 825 2.731376221 12441 C WS 0 [0] Processed 80777 lines --- blktrace-raid-sdc-sdd-fsync.txt --- Line 10921 shows delay=0.289064 -> 8,32 3 162 1.539932025 0 C WS 0 [0] No major/minor found Processed 19237 lines You mentioned that you did not let the blktrace run to the end, so the cause may not be in the trace-
Making testlv (sda2+sdb6) linear raid, and testing there the fsyncs, I get 116fsync/s after creation+format. Same after boot+mount (117fsync/s). Sorry, about blktrace I don't understand... can I supply something more useful? I just run blktrace during the fsync phase of the test, just to make the file smaller.
(In reply to viric from comment #20) > Making testlv (sda2+sdb6) linear raid, and testing there the fsyncs, I get > 116fsync/s after creation+format. Same after boot+mount (117fsync/s). So you don't see the drop after reboot on a linear LV? > > Sorry, about blktrace I don't understand... can I supply something more > useful? I just run blktrace during the fsync phase of the test, just to make > the file smaller. Never mind when the traces cover the complete fsync phase. As mentioned in my comment #19, blktrace does not unveil a reason in the block layer for the factor 10 drop you report.
(In reply to Heinz Mauelshagen from comment #21) > (In reply to viric from comment #20) > So you don't see the drop after reboot on a linear LV? Exactly. No drop. As for blktrace, I understand that you do not request me more traces.
Notice that in comment #3 I report that I see the same phenomena in mdraid, 100fsync/s (just created+formated+mounted) dropping to 30fsync/s (reboot+mount).
What are the results if you stripe across your 2 PVs (lvcreate -L1T -I4k ...)? No drops after reboot either?
In the test volume, I only have 254 extents. # lvcreate -n testlv -I4k -i 2 -l 240 test Logical volume "testlv" already exists in volume group "test" # mkfs.ext4 /dev/test/testlv ... # mount /dev/test/testlv raidtest/ # pg_test_fsync -f raidtest/prova ... Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 118.950 ops/sec 8407 usecs/op fdatasync 115.677 ops/sec 8645 usecs/op fsync 111.675 ops/sec 8955 usecs/op ---------- Then I reboot # mount /dev/test/testlv raidtest/ # pg_test_fsync -f raidtest/prova open_datasync 116.382 ops/sec 8592 usecs/op fdatasync 119.655 ops/sec 8357 usecs/op fsync 43.608 ops/sec 22932 usecs/op (I tested the commands above more than once, and the numbers are reliably those. I checked that there was not any other io activity with iostat, too)
(In reply to viric from comment #25) > In the test volume, I only have 254 extents. > > # lvcreate -n testlv -I4k -i 2 -l 240 test > Logical volume "testlv" already exists in volume group "test" > ^^^ failed to create testlv as striped, please remove and repeat.
Oh sorry. I just noticed that lvcreate failed. The testlv there was the result of: # lvcreate -n testlv -I4k -l 240 test Ignoring stripesize argument with single stripe. WARNING: ext4 signature detected on /dev/test/testlv at offset 1080. Wipe it? [y/n]: ^C Interrupted... Aborted wiping of ext4. 1 existing signature left on the device. Logical volume "testlv" created. [root@atom:/mnt]# lvcreate -n testlv -I4k -i 2 -l 240 test Logical volume "testlv" already exists in volume group "test" (So, without "-i 2", which I don't know if it was required). Let me test again: ----------------------- [root@atom:/mnt]# lvcreate -n testlv -I4k -i 2 -l 240 test WARNING: ext4 signature detected on /dev/test/testlv at offset 1080. Wipe it? [y/n]: y Wiping ext4 signature on /dev/test/testlv. Logical volume "testlv" created. [root@atom:/mnt]# mkfs.ext4 /dev/test/testlv ... [root@atom:/mnt]# mount /dev/test/testlv raidtest/ [root@atom:/mnt]# pg_test_fsync -f raidtest/prova Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 112.690 ops/sec 8874 usecs/op fdatasync 118.654 ops/sec 8428 usecs/op fsync 29.753 ops/sec 33610 usecs/op fsync_writethrough n/a open_sync 27.871 ops/sec 35880 usecs/op So, in this case, it has slow fsyncs even without reboot. (checked more than once, reliable results). -------------- After reboot [root@atom:/mnt]# mount /dev/test/testlv raidtest/ [root@atom:/mnt]# pg_test_fsync -f raidtest/prova Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 119.843 ops/sec 8344 usecs/op fdatasync 119.526 ops/sec 8366 usecs/op fsync 30.489 ops/sec 32798 usecs/op fsync_writethrough n/a open_sync 29.908 ops/sec 33436 usecs/op (reliable results, test run more than once)
I think we sorted out any HW bottlenecks with those results. As mentioned, I'm not seconding your results with the 4.4.7 Fedora 23 kernel here, which performs good after reboot with the pg_test_fsync runs. Try if the newer kernel works for you.
I have just set 4.4.8. I had created the raid1 testlv before booting it. Linux atom 4.4.8 #1-NixOS SMP Wed Apr 20 06:44:02 UTC 2016 x86_64 GNU/Linux After reboot, on raid1 testlv: Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 14.915 ops/sec 67045 usecs/op fdatasync 14.534 ops/sec 68802 usecs/op fsync 12.385 ops/sec 80743 usecs/op fsync_writethrough n/a open_sync 10.391 ops/sec 96241 usecs/op If I destroy and create again (as in comment #1), I'm back at 100fsync/s. Exactly the same as 4.4.6.
Remember that I could reproduce it in 3.18 too.
Yes, likely something specific to your configuration unless we can come up with a reproducer here.
I have no idea what to look at. If you have idea, I can provide the data. I will try to reproduce it in a new system.
I have created a lv raid1 in a new system (pvs sda2 and sdb2), about 500MB big. Running linux 4.4.7 x86_64. I see, just after creation, 40fsync/s. If I set it linear, 110fsync/s. I notice that I can read it with dd at 100MB/s. But writing (ctrl-c after some seconds): # dd if=/dev/zero of=/dev/test/testlv oflag=direct bs=8K dd: error en escriure '/dev/test/testlv': No space left on device 61441+0 registres llegits 61440+0 registres escrits 503316480 bytes (503 MB, 480 MiB) copied, 45,43 s, 11,1 MB/s I didn't try rebooting, because it runs at awful speed even just after creating, in this system.
No new ideas here so far. With 8K block size, writing will be inefficient anyway. Use larger block size.
Ok. Create the lvm raid1, and run: dd oflag=direct bs=5M: 44MB/s. I reboot and run the same dd: 37,6MB/s Both results seem close, but the dd definitely reports 33,6MB/s exacts, in 3 tests in a row. If I mkfs.ext4 on that testlv, and mount it (I created the lv in the previous boot, not this), I get 15 fsync/s as before. So it's not the ext4. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 14.150 ops/sec 70671 usecs/op fdatasync 14.922 ops/sec 67014 usecs/op fsync 12.943 ops/sec 77262 usecs/op fsync_writethrough n/a open_sync 11.788 ops/sec 84833 usecs/op
Just tested the conventional lvm raid1, in that 4.4.7 system of comment #33. Speed just created+format+mount: 40 fsync/s (as reported in comment #33) Reboot, mount and test: 3,9 fsync/s. So, I can reproduce it in a 2nd system with completely different hardware (all spindisks, though).
Contining from prev comment, now that I have the raid running slow (after-boot situation), I run the dd: # dd if=/dev/zero of=/dev/test/testlv oflag=direct bs=8k count=1000 1000+0 registres llegits 1000+0 registres escrits 8192000 bytes (8,2 MB, 7,8 MiB) copied, 96,0817 s, 85,3 kB/s Something is pretty wrong. No other I/O load in any disk of the computer, I checked. Any run of the dd gives 80-90kB/s. This dd on a just created raid1 (without reboot in the middle) gave 11MB/s.
Created attachment 1153591 [details] blktrace in system2 of a dd direct bs=8k that runs at 89kB/s on the lvm raid1 As I was in the after-boot situation of very-slow-lvm-raid1, I run the dd having started blktrace before, and stopping it after the dd. I hope it helps. [root@comanegra:~]# blktrace -D /dev/shm/ /dev/sda2 /dev/sdb2 /dev/test/testlv & [1] 3989 [root@comanegra:~]# dd if=/dev/zero of=/dev/test/testlv oflag=direct bs=8k count=100 100+0 registres llegits 100+0 registres escrits 819200 bytes (819 kB, 800 KiB) copied, 9,30974 s, 88,0 kB/s [root@comanegra:~]# fg blktrace -D /dev/shm/ /dev/sda2 /dev/sdb2 /dev/test/testlv ^C=== sda2 === CPU 0: 910 events, 43 KiB data CPU 1: 2151 events, 101 KiB data CPU 2: 1191 events, 56 KiB data CPU 3: 5838 events, 274 KiB data Total: 10090 events (dropped 0), 474 KiB data === sdb2 === CPU 0: 703 events, 33 KiB data CPU 1: 2151 events, 101 KiB data CPU 2: 1185 events, 56 KiB data CPU 3: 5362 events, 252 KiB data Total: 9401 events (dropped 0), 441 KiB data === dm-9 === CPU 0: 31 events, 2 KiB data CPU 1: 29 events, 2 KiB data CPU 2: 73 events, 4 KiB data CPU 3: 168 events, 8 KiB data Total: 301 events (dropped 0), 15 KiB data
Just to add that it only depends on recreating the raid1... in the previous slow situation, see the following commands: [root@comanegra:~]# dd if=/dev/zero of=/dev/test/testlv oflag=direct bs=8k count=100 100+0 registres llegits 100+0 registres escrits 819200 bytes (819 kB, 800 KiB) copied, 9,30974 s, 88,0 kB/s [root@comanegra:~]# systemctl suspend ***** I AWAKE IT FROM SUSPEND ********* [root@comanegra:~]# dd if=/dev/zero of=/dev/test/testlv oflag=direct bs=8k count=100 100+0 registres llegits 100+0 registres escrits 819200 bytes (819 kB, 800 KiB) copied, 9,35214 s, 87,6 kB/s [root@comanegra:~]# lvremove test/testlv ^[/Do you really want to remove active logical volume testlv? [y/ny Logical volume "testlv" successfully removed [root@comanegra:~]# lvcreate -n testlv -l 120 -m1 test allocation/use_blkid_wiping=1 configuration setting is set while LVM is not compiled with blkid wiping support. Falling back to native LVM signature detection. Logical volume "testlv" created. [root@comanegra:~]# dd if=/dev/zero of=/dev/test/testlv oflag=direct bs=8k count=100 100+0 registres llegits 100+0 registres escrits 819200 bytes (819 kB, 800 KiB) copied, 0,0721038 s, 11,4 MB/s
Not able to reproduce your slow dd results here: # perf record dd if=/dev/zero of=/dev/tst/r oflag=direct ibs=8K iflag=fullblock obs=8K count=100 100+0 records in 100+0 records out 819200 bytes (819 kB) copied, 0.0584847 s, 14.0 MB/s Report: 14.29% dd [kernel.vmlinux] [k] _clear_user 14.29% dd [kernel.vmlinux] [k] get_user_pages_fast 7.14% dd [kernel.vmlinux] [k] __kmalloc 7.14% dd [kernel.vmlinux] [k] bio_add_page 7.14% dd [kernel.vmlinux] [k] entry_SYSCALL_64_fastpath 7.14% dd [kernel.vmlinux] [k] finish_task_switch 7.14% dd [kernel.vmlinux] [k] get_page_from_freelist 7.14% dd [md_mod] [k] bitmap_get_counter 7.14% dd [raid1] [k] raid1_make_request 7.14% dd libc-2.22.so [.] __GI___libc_write 7.14% dd libc-2.22.so [.] __fprintf_chk 7.14% dd libc-2.22.so [.] _dl_addr
# perf record dd if=/dev/zero of=/dev/test/testlv oflag=direct bs=8k count=100 100+0 registres llegits 100+0 registres escrits 819200 bytes (819 kB, 800 KiB) copied, 9,59411 s, 85,4 kB/s [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.014 MB perf.data (78 samples) ] # Overhead Command Shared Object Symbol # ........ ....... ................. ................................ # 24.66% dd libc-2.23.so [.] __strcmp_sse2 5.51% dd [kernel.kallsyms] [k] dequeue_task_fair 4.58% dd [kernel.kallsyms] [k] __clear_user 3.24% dd [kernel.kallsyms] [k] find_busiest_group 3.18% dd libc-2.23.so [.] add_module.isra.1 2.84% dd libc-2.23.so [.] __tsearch 2.83% dd [kernel.kallsyms] [k] unlock_page 2.76% dd [kernel.kallsyms] [k] syscall_trace_enter_phase1 2.71% dd [kernel.kallsyms] [k] copy_page 2.51% dd libc-2.23.so [.] __GI___libc_write 2.45% dd [kernel.kallsyms] [k] ___slab_alloc 2.45% dd [kernel.kallsyms] [k] idle_cpu 2.45% dd [md_mod] [k] 0x000000000000653d 2.24% dd [kernel.kallsyms] [k] mempool_free 2.07% dd [kernel.kallsyms] [k] __msecs_to_jiffies 2.07% dd [kernel.kallsyms] [k] prepare_to_wait_event 1.95% dd libc-2.23.so [.] __memcmp_sse2 1.70% dd libc-2.23.so [.] __gconv_read_conf 1.57% dd [kernel.kallsyms] [k] _raw_spin_lock_irqsave 1.54% dd [kernel.kallsyms] [k] __wake_up_common 1.54% dd [kernel.kallsyms] [k] bio_advance 1.54% dd [kernel.kallsyms] [k] dio_bio_complete 1.54% dd [kernel.kallsyms] [k] mempool_alloc 1.54% dd [kernel.kallsyms] [k] resched_curr 1.52% dd [kernel.kallsyms] [k] _raw_spin_lock 1.52% dd [kernel.kallsyms] [k] _raw_write_unlock_irqrestore 1.52% dd [kernel.kallsyms] [k] cmpxchg_double_slab.isra.61 1.52% dd [kernel.kallsyms] [k] schedule 1.52% dd [md_mod] [k] 0x0000000000010925 1.52% dd ld-2.23.so [.] check_match 1.52% dd libc-2.23.so [.] __memcpy_sse2 1.52% dd libc-2.23.so [.] memchr 1.50% dd [kernel.kallsyms] [k] __do_page_fault 1.47% dd libc-2.23.so [.] malloc 1.38% dd [kernel.kallsyms] [k] __mnt_is_readonly.part.8 0.71% dd [kernel.kallsyms] [k] __blockdev_direct_IO 0.59% dd [kernel.kallsyms] [k] page_fault 0.57% dd [kernel.kallsyms] [k] schedule_timeout 0.06% dd [kernel.kallsyms] [k] blk_check_plugged 0.05% dd [kernel.kallsyms] [k] clear_buddies 0.05% dd [kernel.kallsyms] [k] do_close_on_exec 0.01% dd [kernel.kallsyms] [k] native_write_msr_safe 0.00% perf [kernel.kallsyms] [k] native_write_msr_safe
Now the same with "perf record -a": # Total Lost Samples: 0 # # Samples: 11K of event 'cycles' # Event count (approx.): 665454046 # # Overhead Command Shared Object Symbol # ........ ............. ...................... ......................................... # 76.24% swapper [kernel.kallsyms] [k] native_safe_halt 0.48% dd libc-2.23.so [.] __strcmp_sse2 0.44% swapper [kernel.kallsyms] [k] ahci_handle_port_interrupt 0.33% swapper [kernel.kallsyms] [k] __schedule 0.29% swapper [kernel.kallsyms] [k] update_wall_time 0.29% swapper [kernel.kallsyms] [k] read_tsc 0.23% swapper [kernel.kallsyms] [k] _raw_spin_lock 0.23% swapper [kernel.kallsyms] [k] __switch_to
Created attachment 1153915 [details] perf stat -B in before/after recreation
Created attachment 1153925 [details] perf stat -a of dd before/after recreation, and of sleep 9 I add a perf stat -a (all cpus), and also that of "sleep 9" (more or less the amount of time of the slow dd). The fast dd manages to do the job in 1613 context switches, sleep 9 takes 2558 context switches (few other processes in the system), and the slow dd takes 12105 context switches. That's the biggest difference I see.
I noticed this about immediate safe mode at boot. Maybe it is relevant? # dmesg | grep -e 'md:\|logical' [ 1.662122] md: raid6 personality registered for level 6 [ 1.662124] md: raid5 personality registered for level 5 [ 1.662125] md: raid4 personality registered for level 4 [ 1.695356] sd 2:0:0:0: [sdb] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB) [ 1.695379] sd 1:0:0:0: [sda] 1465149168 512-byte logical blocks: (750 GB/699 GiB) [ 2.039841] md: raid1 personality registered for level 1 [ 2.211842] stage-1-init: 1 logical volume(s) in volume group "raid2" now active [ 2.385094] stage-1-init: 1 logical volume(s) in volume group "test" now active [ 2.710032] md: mdX in immediate safe mode [ 2.710176] md: mdX in immediate safe mode
Some findings with kabi on irc... 1) In the distro we have "vgchange -ay" at boot. It activated my rootfs (raid2) and test/testlv. 2) If I change that to "vgchange -ay raid2", testlv is not activated at boot, and if I activate it manually after boot, it runs fast. To be noted, "raid2" also runs fast. 3) In the distro we don't have a clear shutdown that deactivates the arrays. That may be the source of all this. Nevertheless... what makes an array go to normal mode? Tried triggering a check: [ 234.879970] md: data-check of RAID array mdX [ 234.879980] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. [ 234.879986] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check. [ 234.879994] md: using 128k window, over a total of 491520k. [ 240.841787] md: mdX: data-check done. Yet the array still performs slow after it. The only way I found to reliably make testlv run fast (normal mode, I guess) is to deactivate it before shutdown, and reboot with "vgchange -ay raid2" (so, not activating it), then activating it manually. Maybe we miss something in the initrd? But why raid2 (rootfs) works fine, despite having the same message of "immediate safe mode"?
Isn't, in fact, this "immediate safe mode" a mode to set the array on shutdown, so you aren't required to deactivate it? This is what I read from this 2012 blog post: http://neil.brown.name/blog/20120615073245 If so, how to set that mode at shutdown, so the array looks perfectly safe on boot, as if it were deactivated on shutdown?
This immediate safe mode seems to hit every raid in our system, not only lvm raid. In the first box I reported about, where mdraid and lvmraid live together, we get this at boot: [ 5.399844] md: md127 in immediate safe mode [ 5.400316] md: md126 in immediate safe mode [ 5.400439] md: md125 in immediate safe mode [ 5.400677] md: md123 in immediate safe mode [ 5.400828] md: md124 in immediate safe mode [ 5.401565] md: mdX in immediate safe mode No idea what should bring those arrays out of safe mode, or what are we doing special in the distro that they go into that state on boot. We reproduce it in every system, even in vms.
Ok, we solved the "immediate safe mode" at boot, in the distro. We were running a "pkill -9 -v -f @" to ensure that we were not leaving any unwanted process in the initrd... And that was signaling the kernel PIDs, and if you signal the kernel PID it puts the RAID into safe mode. That was thought for the big kill at the common shutdown sequences of distros, I guess.
So has that made all the other problems go away?
This definitely cleared any difference between just-booted/just-created arrays, both mdraid/lvmraid. Now I'm hitting another issue in the same box, about low fsync performance (7fsync/s in one lvmraid1), but may be related to the SATA controller. This may be a topic of a new issue, if you want. The Controller is a Sil 3114, and there are reports in mailing lists about its low performance. Basically, testing with 1GB writing of dd oflags=direct, I get 55MB/s with single disk writing in that controller, and 10MB/s if raid1. The controller reports to have a NCQ queue of 0/32, while the other SATA controller in the system reports a queue of 31/32. Swapping a pair of disks controller-wise clearly shows that the NCQ queue depth is related to the controller. But by now I haven't found a way to go over 10MB/s in that raid1. Thank you for the support.
Hello, I have encounter similar slow IOPS issue with SSD. For my case: I have benchmark with FIO: Case 1) SSD direct connect to raid card in HBA mode Case 2) SSD direct connect to motherboard The IOPS difference is huge. 18000, 600 (R,W) IOPS for case 1 600, 220 (R,W) IOPS for case 2 The only difference I noticed is that the SSD does not have DPO/FUA support in SATA. It have DPO/FUA support when it is connected to the HBA. (SATA over SAS)
(In reply to Patrick Dung from comment #52) > Hello, > > I have encounter similar slow IOPS issue with SSD. > > For my case: I have benchmark with FIO: > Case 1) SSD direct connect to raid card in HBA mode > Case 2) SSD direct connect to motherboard > > The IOPS difference is huge. > 18000, 600 (R,W) IOPS for case 1 > 600, 220 (R,W) IOPS for case 2 > > The only difference I noticed is that the SSD does not have DPO/FUA support > in SATA. > It have DPO/FUA support when it is connected to the HBA. (SATA over SAS) Have you been able to address any HW issues successfully raising IOPS? Closing the bz, please reopen in case any non-HW issues occur.