Description of problem: ================== Creation of files on hot tier volume taking very long time Version-Release number of selected component (if applicable): ============== glusterfs-server-3.7.5-6 How reproducible: Steps to Reproduce: ========= 1. Created distributed replica volume and then attach 4 hot tier bricks 2. Mount the volume on client using FUSE and tried creation of 10k files using following script and it took 17min for i in {1..10000} do echo $i >> /mnt/disrep_tier/dht/file$i done 3. Using the same bricks (mount) created distributed replica volume and creation of 10k files took 1min Actual results: Expected results: ======== Creation of files on tier volume should not take this much time Additional info: ========== [root@rhs-client18 dht]# gluster vol info disrep_tier Volume Name: disrep_tier Type: Tier Volume ID: ea4bd2c2-efd3-4d25-bbc1-8f6d9c75dafc Status: Started Number of Bricks: 8 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 2 x 2 = 4 Brick1: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick5/tier Brick2: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick5/tier Brick3: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick6/tier Brick4: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick6/tier Cold Tier: Cold Tier Type : Distributed-Replicate Number of Bricks: 2 x 2 = 4 Brick5: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick7/disrep_teri Brick6: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick7/disrep_teri Brick7: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick6/disrep_teri Brick8: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick6/disrep_teri Options Reconfigured: cluster.tier-demote-frequency: 600 performance.readdir-ahead: on features.ctr-enabled: on Normal volume ------------------- [root@rhs-client19 ~]# gluster vol info afr2x2_test Volume Name: afr2x2_test Type: Distribute Volume ID: 8fcada3a-d6ab-4d62-ac89-dbd7b360c498 Status: Started Number of Bricks: 4 Transport-type: tcp Bricks: Brick1: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick5/afr2x2_test Brick2: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick5/afr2x2_test Brick3: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick6/afr2x2_test Brick4: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick6/afr2x2_test Options Reconfigured: performance.readdir-ahead: on [root@rhs-client19 ~]#
cluster.lookup-optimize must be set to "on" for tiered volumes to avoid this problem. The optimization will work for new directories created after the tier is attached. For existing directories, including root, a rebalance is required before the tier is attached. A future optimization could do a few things for tiered volumes to make the process easier: 1. turn on lookup-optimize automatically 2. remove the requirement to run rebalance on pre-existing directories
Recommend we tell customers to use switch in comment #3, and defer future optimizations for later.
Set the cluster.lookup-optimize option on volume and ran the script to create 10k files and script took 34m [root@dht-rhs-23 dht_retest]# time /mnt/disrep_tier/dht_retest/create.sh real 34m12.295s user 0m1.874s sys 0m5.472s [root@dht-rhs-23 dht_retest]# cat /mnt/disrep_tier/dht_retest/create.sh for i in {1..10000} do echo $i >> /mnt/disrep_tier/dht_retest/file$i done [root@dht-rhs-23 dht_retest]# mount | grep /mnt/disrep_tier rhs-client18.lab.eng.blr.redhat.com:/disrep_tier on /mnt/disrep_tier type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) [root@dht-rhs-23 dht_retest]# [root@rhs-client18 core]# gluster vol info disrep_tier Volume Name: disrep_tier Type: Tier Volume ID: ea4bd2c2-efd3-4d25-bbc1-8f6d9c75dafc Status: Started Number of Bricks: 8 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 2 x 2 = 4 Brick1: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick5/tier Brick2: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick5/tier Brick3: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick6/tier Brick4: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick6/tier Cold Tier: Cold Tier Type : Distributed-Replicate Number of Bricks: 2 x 2 = 4 Brick5: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick7/disrep_teri Brick6: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick7/disrep_teri Brick7: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick6/disrep_teri Brick8: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick6/disrep_teri Options Reconfigured: cluster.lookup-optimize: on cluster.tier-mode: test cluster.tier-demote-frequency: 600 performance.readdir-ahead: on features.ctr-enabled: on
Can you post /mnt/disrep_tier/dht_retest/create.sh
I am using fallowing script [root@dht-rhs-23 dht_retest]# cat /mnt/disrep_tier/dht_retest/create.sh for i in {1..10000} do echo $i >> /mnt/disrep_tier/dht_retest/file$i done
Can you tell me in what order did you did the following steps - attach tier - set lookup optimization - create directories disrep_tier/dht_retest
Performed my tests in the fallowing order 1. Attach tier (No data present before attaching the tier) 2. Set Lookup optimization 3. Created new directory and created files on that directory
Would you please repeat the same create.sh against a 2x4 (8 brick) distributed replica volume and post the times?
Tested with distributed replica (2x4) and took 42 seconds time to create files [root@vertigo data]# time ./create.sh real 0m42.844s user 0m0.630s sys 0m1.431s [root@vertigo data]# mount | grep dht_bug rhs-client18.lab.eng.blr.redhat.com:dht_bug on /mnt/dht_bug type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) [root@rhs-client18 test_delete-tier-dht]# gluster vol info dht_bug Volume Name: dht_bug Type: Distributed-Replicate Volume ID: cda73bf1-641c-407c-8c7d-9853ac24e2fb Status: Created Number of Bricks: 2 x 4 = 8 Transport-type: tcp Bricks: Brick1: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick1/dht_bug Brick2: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick1/dht_bug Brick3: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick2/dht_bug Brick4: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick2/dht_bug Brick5: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick3/dht_bug Brick6: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick3/dht_bug Brick7: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick4/dht_bug Brick8: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick4/dht_bug Options Reconfigured: performance.readdir-ahead: on [root@rhs-client18 test_delete-tier-dht]# gluster vol start dht_bug volume start: dht_bug: success
My test and observations... Test: 1. create volume. all options at default, except client.event-threads=4, server.event-threads=4, lookup.optimize=on. 2. mount on client (/mnt/glustervol) 3. mkdir /mnt/glustervol/test_dir 4. time for ii in `seq 100000`; do echo $ii > /mnt/glustervol/test_dir/f${ii}; done For 2x(8+4) SAS-HDD-JBOD volume: real 6m58.361s user 0m4.346s sys 0m5.869s For 2x2 SAS-SSD-JBOD volume: real 4m31.346s user 0m3.931s sys 0m5.534s For tiered volume: a. create 2x(8+4) base volume, with tunable parameters set as before i.e. event-threads and lookup-optimize. b. attach 2x2 SAS-SSD-JBOD bricks as cache tier to empty base volume c. set: features.ctr-enabled on, cluster.tier-mode cache repeat steps 2, 3, 4 above. real 10m31.760s user 0m6.752s sys 0m6.457s Additional info: repeated the tiered volume run with the following additional tuning: RHS_TIER_RC="on" # features.record-counters RHS_TIER_WFT=32 # cluster.write-freq-threshold RHS_TIER_RFT=32 # cluster.read-freq-threshold RHS_TIER_PF=36000 # cluster.tier-promote-frequency, seconds RHS_TIER_DF=36000 # cluster.tier-demote-frequency, seconds No noticeable change in time seen. The results seen here are consistent with the hypothesis that tiered volume is doing lookup serially on the hot tier and the cold tier in the case of creates, which leads to worse performance compared to non-tiered volumes. [Not saying that this is the root cause, just that the observations are consistent with this hypothesis]. Note that time taken for tiered volume is basically the sum of the time taken by constituent tiers in these tests.
(In reply to Manoj Pillai from comment #12) > My test and observations... Forgot to include version: glusterfs-server-3.7.5-7.el7rhgs.x86_64 kernel-3.10.0-229.el7.x86_64 (RHEL-7.1) Rajesh, can you please include your OS and gluster version as well?
Gluster version: glusterfs-server-3.7.5-6 OS Version: Red Hat Enterprise Linux Server release 7.1 (3.10.0-229.7.2.el7.x86_64)
Can you rerun the test with the client and server on the same subnet ? Running gluster volume profile, I see the latency for fops is about 300% higher between the QE client (10.70.34.56/23 network 0.70.36.0) and server (10.70.36.42/23 10.70.34.0), than on a similar BAGL configuration I have in the perf lab. I'm not sure this explains everything but we should bring the configurations to a level-set state.
Mounted the volume on the same machine (rhs-client18) so no more different sub nets and to creation 5k files took 12m [root@rhs-client18 bug_retest]# mount | grep test_tier rhs-client18.lab.eng.blr.redhat.com:test_tier on /mnt/test_tier type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) [root@rhs-client18 bug_retest]# time ./create.sh real 12m51.803s user 0m0.581s sys 0m1.111s [root@rhs-client18 bug_retest]# ^C
The volume tier_test has ~20K files. I created a second volume with no files. The creates and unlinks on the unpopulated volume were much faster, see numbers below. Joseph plans to batch database inserts into a single transaction. Currently we have a separate transaction for each database insert. The theory that this will be the right optimization is support by other sqlite users [1]. http://stackoverflow.com/questions/1711631/improve-insert-per-second-performance-of-sqlite suggest to reassign this to joseph. --- Test was to turn off database (Turning off ctr). - there is no promotion/demotion happening. - There are 27K files in the database. - on a different volume with an empty database I do *not* see the slowdown. As far as these tests go , it appears we see slower overall CREATE and UNLINK performance as the number of entries in the database grows. [root@rhs-client18 ~]# g v set test_tier features.ctr-enabled off volume set: success [root@rhs-client18 ~]# echo 3 > /proc/sys/vm/drop_caches;time for i in {1..500};do echo $i > /mnt/test_tier/newdir/f$i;done real 0m22.386s user 0m0.032s sys 0m0.126s [root@rhs-client18 ~]# echo 3 > /proc/sys/vm/drop_caches;time rm -f /mnt/test_tier/newdir/f* real 0m30.410s user 0m0.014s sys 0m0.069s [root@rhs-client18 ~]# g v set test_tier features.ctr-enabled on volume set: success [root@rhs-client18 ~]# echo 3 > /proc/sys/vm/drop_caches;time for i in {1..500};do echo $i > /mnt/test_tier/newdir/f$i;done real 2m1.057s user 0m0.029s sys 0m0.166s [root@rhs-client18 ~]# echo 3 > /proc/sys/vm/drop_caches;time rm -f /mnt/test_tier/newdir/f* real 1m31.485s user 0m0.011s sys 0m0.076s echo "select * from gf_flink_tb;" | sqlite3 /rhs/brick5/test_tier_hot4/.glusterfs/test_tier_hot4.db|wc -l 27551 ---- Second volume with empty database. [root@rhs-client18 ~]# echo 3 > /proc/sys/vm/drop_caches;time for i in {1..500};do echo $i > /mnt/test_tier_b/newdir/f$i;done real 0m8.279s user 0m0.024s sys 0m0.064s [root@rhs-client18 ~]# echo 3 > /proc/sys/vm/drop_caches;time rm -f /mnt/test_tier_b/newdir/f* real 0m12.008s user 0m0.012s sys 0m0.053s [root@rhs-client18 ~]# echo "select * from gf_flink_tb;" | sqlite3 /rhs/brick5/test_tier_hot4.b/.glusterfs/test_tier_hot4.b.db|wc -l 0
On a tiered volume -- 2x(8+4) base volume, 2x2 cache-tier -- I run the following commands from a client after attach-tier: 1. mkdir /mnt/glustervol/test_dir [/mnt/glustervol is the volume mount point) 2. for ii in `seq 10`; do echo $ii > /mnt/glustervol/test_dir/f${ii}; done On a hot tier server, when I do ls on a brick: [root@gprfc082 ~]# ls -l /mnt/rhs_brick1/test_dir/ total 20 -rw-r--r-- 2 root root 2 Dec 14 00:32 f1 -rw-r--r-- 2 root root 3 Dec 14 00:32 f10 -rw-r--r-- 2 root root 2 Dec 14 00:32 f2 -rw-r--r-- 2 root root 2 Dec 14 00:32 f3 -rw-r--r-- 2 root root 2 Dec 14 00:32 f5 On a cold tier server: [root@gprfs045 ~]# ls -l /mnt/rhs_brick4/test_dir/ total 0 ---------T 2 root root 0 Dec 14 00:32 f1 ---------T 2 root root 0 Dec 14 00:32 f10 ---------T 2 root root 0 Dec 14 00:32 f2 ---------T 2 root root 0 Dec 14 00:32 f3 ---------T 2 root root 0 Dec 14 00:32 f5 Can you explain how the create on a tiered volume is handled, how it leads to this output? Or point me to a design doc that explains it?
The T files are called "link files". They are artifacts of how DHT works. In DHT, every file is assigned to a particular node based on the hash of the directory it resides in. This is the "hashed sub volume". In tiering, the hashed sub volume is always the cold sub volume (as we need a single/stable point to locate metadata). In DHT, when a file is not on a hashed subvolume, a link file resides in its place. The link file points to the actual sub volume where the file is. In your case the file was put on the non-hashed (hot) sub volume, so the cold sub volume has the link file. Lets not use this bug any longer for these discussions- as I would like to contain it to Joseph's "create coalescing" patch. But please feel free to email the tiering group.
(In reply to Dan Lambright from comment #19) There seem to be two problems here. 1. As in comment 12, create phase on tiered vol takes about as much time as sum of create phase on non-tiered vol on cold tier bricks and non-tiered vol on hot tier bricks. 2. there is further degradation in create performance for tiered volumes because of ctr inefficiencies (comment 17). What is the scope of this bz? I'm assuming that the "create coalescing patch" will focus of problem 2. With regard to problem 1, is the current behaviour the best we can hope for? That is what I was trying to get at with comment 18.
The larger problem seems to be item #2 and I think that will be the focus of this bug. Item #1 can be addressed after item #2, either in this bug or a different one, using some of the suggested mechanisms (parallel lookup? removing extraneous lookups)?
https://code.engineering.redhat.com/gerrit/#/c/64372/1 Options provided in for good performance gluster vol set features.ctr-sql-db-wal-autocheckpoint 25000 gluster vol set features.ctr-sql-db-cachesize 12500 gluster vol set help for details
Creation of 10k files took 3 mins [root@vertigo test]# time ./create.sh real 3m51.858s user 0m0.546s sys 0m1.166s And Creation of 50k files took 24 mins [root@vertigo test]# time ./create.sh real 24m22.629s user 0m4.650s sys 0m11.221s [root@ninja bug_fix]# gluster vol info bug_perform Volume Name: bug_perform Type: Tier Volume ID: de6ac608-ecf1-4122-a60d-22ea1092ec7c Status: Started Number of Bricks: 8 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 2 x 2 = 4 Brick1: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick2/bug_perform_hot Brick2: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick2/bug_perform_hot Brick3: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick1/bug_perform_hot Brick4: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick1/bug_perform_hot Cold Tier: Cold Tier Type : Distributed-Replicate Number of Bricks: 2 x 2 = 4 Brick5: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick4/bug_perform_cold1 Brick6: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick4/bug_perform_cold1 Brick7: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick3/bug_perform_cold1 Brick8: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick3/bug_perform_cold1 Options Reconfigured: cluster.tier-mode: cache features.ctr-enabled: on performance.readdir-ahead: on Comparatively These numbers are better than previous run but still performance on tiered volume is not as good as normal (Distributed replica) volume And Creation of 50k files took 21 mins distributed replica volume [root@vertigo bug]# time ./create.sh real 21m7.993s user 0m2.921s sys 0m6.182s [root@vertigo bug]# gluster vol info dht_bug Volume Name: dht_bug Type: Distributed-Replicate Volume ID: cda73bf1-641c-407c-8c7d-9853ac24e2fb Status: Started Number of Bricks: 2 x 4 = 8 Transport-type: tcp Bricks: Brick1: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick1/dht_bug Brick2: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick1/dht_bug Brick3: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick2/dht_bug Brick4: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick2/dht_bug Brick5: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick3/dht_bug Brick6: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick3/dht_bug Brick7: rhs-client19.lab.eng.blr.redhat.com:/rhs/brick4/dht_bug Brick8: rhs-client18.lab.eng.blr.redhat.com:/rhs/brick4/dht_bug Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on performance.readdir-ahead: on cluster.lookup-optimize: on
Tested with glusterfs-server-3.7.5-13, After setting features.ctr-sql-db-cachesize: 12500 features.ctr-sql-db-wal-autocheckpoint: 25000 creation of 50k files took 24 m and where as on normal distributed replica volume took 21m As i need to repeat the same tests with build which having above settings as default values so please move this bug to ON_QA once build having proper default values
sure
Tested with 3.7.5-14 build and after creation of new tiered volume verified the features.ctr-sql-db-wal-autocheckpoint and features.ctr-sql-db-cachesize default values and those values are not modified so marking this bug as failed QA [root@tettnang afr1x2_tier_bug]# rpm -qa | grep glusterfs glusterfs-3.7.5-14.el7rhgs.x86_64 glusterfs-geo-replication-3.7.5-14.el7rhgs.x86_64 glusterfs-fuse-3.7.5-14.el7rhgs.x86_64 glusterfs-debuginfo-3.7.5-14.el7rhgs.x86_64 glusterfs-api-3.7.5-14.el7rhgs.x86_64 glusterfs-rdma-3.7.5-14.el7rhgs.x86_64 glusterfs-client-xlators-3.7.5-14.el7rhgs.x86_64 glusterfs-server-3.7.5-14.el7rhgs.x86_64 glusterfs-ganesha-3.7.5-14.el7rhgs.x86_64 glusterfs-cli-3.7.5-14.el7rhgs.x86_64 glusterfs-libs-3.7.5-14.el7rhgs.x86_64 glusterfs-api-devel-3.7.5-14.el7rhgs.x86_64 glusterfs-devel-3.7.5-14.el7rhgs.x86_64 [root@tettnang afr1x2_tier_bug]# gluster vol get perform_create all | grep cachesize features.ctr-sql-db-cachesize 1000 [root@tettnang afr1x2_tier_bug]# gluster vol get perform_create all | grep ctr-sql-db-wal-autocheckpoint features.ctr-sql-db-wal-autocheckpoint 1000 [root@tettnang afr1x2_tier_bug]#
Creation of 50k files took 6m and by default features.ctr-sql-db-cachesize: and features.ctr-sql-db-wal-autocheckpoint set to 12500 & 25000 respectively so marking this bug as verified [root@transformers perform_create_new]# time ./create.sh real 6m22.149s user 0m2.289s sys 0m8.038s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0193.html