Description of problem: Performance of I/O to a tiered volume is seen to be extremely poor in some cases. When the migration-related tunable parameters (particularly cluster.tier-*-frequency) are set to prevent promotion/demotion from kicking in for the duration of the test, performance improves significantly. Version-Release number of selected component (if applicable): glusterfs-server-3.7.5-7.el7rhgs.x86_64 Red Hat Enterprise Linux Server release 7.1 (Maipo) How reproducible: Consistently. Steps to Reproduce: I'm listing the steps in the test that showed the problem. We should be able to reproduce the problem with a much simpler test, but for now listing current test details. 1. create 2x(8+4) base volume (about 15TB capacity); attach 2x2 SAS-SSD as hot tier (about 360GB capacity). fuse mount on a set of clients. 2. create directory smf_init in the mount point. create a data set of size 480GB within directory smf-init, of large files each 256MB in size. This fills up the hot tier to the max allowed. 3. create a directory smf_data in the mount point. create a data set of size 32GB within smf_data, of small files each 64KB. rm -rf <mnt-pt>/smf_init. this deletes all files created in step 1 and creates space within the hot tier. 4. read files in the directory <mnt-pt>/smf_data and record throughput of reads. Actual results: create phase for small files in smf_data reported throughput of 2.2 MB/s. read phase for small files in smf_data reported thoughput of 5MB/s. in comparison, create phase on base volume (disperse volume) reported 102MB/s and read phase on base volume reported 196MB/s. Expected results: For this test, we should get close to base volume performance. Additional info:
Some comparitive performance results for different configurations that point to where the problems might lie. Performance when tests are run on the base volume (2x(8+4)): small-file create phase: 102.727173 MB/sec small-file read phase: 196.626994 MB/sec Performance on base volume + ctr enabled: create phase: 23.496851 MB/sec read phase: 147.093099 MB/sec [clearly, ctr is a big part of the overall problem, but there are other bzs to track ctr problems] Performance on tiered volume, with following settings: RHS_TIER_CTR="on" # features.ctr-enabled RHS_TIER_MODE="cache" # cluster.tier-mode RHS_TIER_RC="on" # features.record-counters RHS_TIER_WFT=1024 # cluster.write-freq-threshold RHS_TIER_RFT=1024 # cluster.read-freq-threshold RHS_TIER_PF=600 # cluster.tier-promote-frequency, seconds RHS_TIER_DF=600 # cluster.tier-demote-frequency, seconds create phase: 2.284576 MB/sec read phase: 4.997511 MB/sec Performance on tiered volume, with following settings: RHS_TIER_CTR="on" # features.ctr-enabled RHS_TIER_MODE="cache" # cluster.tier-mode RHS_TIER_RC="on" # features.record-counters RHS_TIER_WFT=8192 # cluster.write-freq-threshold RHS_TIER_RFT=8192 # cluster.read-freq-threshold RHS_TIER_PF=36000 # cluster.tier-promote-frequency, seconds RHS_TIER_DF=36000 # cluster.tier-demote-frequency, seconds create phase: 23.675706 MB/sec read phase: 149.278938 MB/sec [So bumping up tier*frequency parameters so that migration does not kick in lifts performance up to the level of base volume + ctr enabled on].
Though not directly causing the issue raised in this bug, I observed following things which is relevant during file promotion/demotion in tier: 1. rebalance changes ctime/atime/mtime of a file (even with no access from application) as it does setxattr to convert src to linkto file (this is through code reading, but didn't test it). This will result in md-cache to do invalidation of inode. However, since fuse-bridge doesn't by default send invalidation notification to kernel by default (this behaviour is controlled by option "fopen-keep-cache", which is "off" by default), this doesn't cause much of an issue. If fopen-keep-cache is turned on, a single promotion/demotion has the effect of purging _all_ (data/metadata) cache from kernel and glusterfs client. Even with fopen-keep-cache turned off, md-cache and io-cache would purge their cache with (m)(c)(a)time modifications.
Did some tests to corroborate my hypothesis in comment 7: [root@unused raghu]# gluster volume info dist Volume Name: dist Type: Distribute Volume ID: 31f2c96d-6153-43dc-aeec-8a37d879eab4 Status: Started Number of Bricks: 5 Transport-type: tcp Bricks: Brick1: booradley:/home/export/dist1 Brick2: booradley:/home/export/dist2 Brick3: booradley:/home/export/dist3 Brick4: booradley:/home/export/dist4 Brick5: booradley:/home/export/dist5 Options Reconfigured: performance.readdir-ahead: on [root@unused raghu]# ls /home/export/dist[1-5]/ -l /home/export/dist1/: total 1028 -rw-r--r--. 2 root root 1048576 Feb 3 11:55 1 /home/export/dist2/: total 4 ---------T. 2 root root 0 Feb 3 11:54 1 /home/export/dist3/: total 0 /home/export/dist4/: total 0 /home/export/dist5/: total 0 [root@unused raghu]# stat /home/export/dist[1][2]/1 stat: cannot stat `/home/export/dist[1][2]/1': No such file or directory [root@unused raghu]# stat /home/export/dist[1-2]/1 File: `/home/export/dist1/1' Size: 1048576 Blocks: 2056 IO Block: 4096 regular file Device: fd02h/64770d Inode: 9834399 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Context: unconfined_u:object_r:home_root_t:s0 Access: 2016-02-03 11:54:57.000000000 +0530 Modify: 2016-02-03 11:55:48.232104215 +0530 Change: 2016-02-03 11:55:48.232104215 +0530 Birth: - File: `/home/export/dist2/1' Size: 0 Blocks: 8 IO Block: 4096 regular empty file Device: fd02h/64770d Inode: 9834398 Links: 2 Access: (1000/---------T) Uid: ( 0/ root) Gid: ( 0/ root) Context: unconfined_u:object_r:home_root_t:s0 Access: 2016-02-03 11:54:57.435058309 +0530 Modify: 2016-02-03 11:54:57.435058309 +0530 Change: 2016-02-03 11:54:57.481058349 +0530 Birth: - [root@unused raghu]# date Wed Feb 3 11:57:30 IST 2016 [root@unused raghu]# gluster volume rebalance dist start force volume rebalance: dist: success: Rebalance on dist has been started successfully. Use rebalance status command to check status of the rebalance process. ID: 4634a88d-f382-42a1-859c-41063cd11da2 [root@unused raghu]# gluster volume rebalance dist status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 1 1.0MB 2 0 0 completed 0.00 volume rebalance: dist: success [root@unused raghu]# stat /home/export/dist[1-2]/1 File: `/home/export/dist2/1' Size: 1048576 Blocks: 2056 IO Block: 4096 regular file Device: fd02h/64770d Inode: 9834398 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Context: unconfined_u:object_r:home_root_t:s0 Access: 2016-02-03 11:58:31.729252000 +0530 Modify: 2016-02-03 11:55:48.232104000 +0530 Change: 2016-02-03 11:58:31.847252648 +0530 Birth: - As can be seen above, rebalance has changed the (m)(c)time of file on brick. And following is the definition of mdc_inode_iatt_set_validate: int mdc_inode_iatt_set_validate(xlator_t *this, inode_t *inode, struct iatt *prebuf, struct iatt *iatt) { int ret = -1; struct md_cache *mdc = NULL; mdc = mdc_inode_prep (this, inode); if (!mdc) goto out; LOCK (&mdc->lock); { if (!iatt || !iatt->ia_ctime) { mdc->ia_time = 0; goto unlock; } /* * Invalidate the inode if the mtime or ctime has changed * and the prebuf doesn't match the value we have cached. * TODO: writev returns with a NULL iatt due to * performance/write-behind, causing invalidation on writes. */ if (IA_ISREG(inode->ia_type) && ((iatt->ia_mtime != mdc->md_mtime) || (iatt->ia_mtime_nsec != mdc->md_mtime_nsec) || (iatt->ia_ctime != mdc->md_ctime) || (iatt->ia_ctime_nsec != mdc->md_ctime_nsec))) if (!prebuf || (prebuf->ia_ctime != mdc->md_ctime) || (prebuf->ia_ctime_nsec != mdc->md_ctime_nsec) || (prebuf->ia_mtime != mdc->md_mtime) || (prebuf->ia_mtime_nsec != mdc->md_mtime_nsec)) inode_invalidate(inode); mdc_from_iatt (mdc, iatt); time (&mdc->ia_time); } unlock: ret = 0; out: return ret; } To be accurate I did this test only with rebalance, but I am assuming tier might also suffer from similar issue as both share significant chunk of code for file migration.
(In reply to Raghavendra G from comment #7) > Though not directly causing the issue raised in this bug, I observed > following things which is relevant during file promotion/demotion in tier: > > 1. rebalance changes ctime/atime/mtime of a file (even with no access from > application) as it does setxattr to convert src to linkto file (this is > through code reading, but didn't test it). This will result in md-cache to > do invalidation of inode. However, since fuse-bridge doesn't by default send > invalidation notification to kernel by default (this behaviour is controlled > by option "fopen-keep-cache", which is "off" by default), this doesn't cause > much of an issue. In modern kernels (RHEL-7) this option will be turned on by default. We turn this option on/off based on whether the kernel gluster is running on supports invalidation or not. So, this option is turned on for rhgs-3.1 onwards I suppose. > If fopen-keep-cache is turned on, a single > promotion/demotion has the effect of purging _all_ (data/metadata) cache > from kernel and glusterfs client. Given that this option can be turned on for significant number of use-cases, the performance hit might be huge for fuse-clients (purging _all_ cache - pages, dentry caching etc - in kernel and glusterfs can have a large impact I suppose). > > Even with fopen-keep-cache turned off, md-cache and io-cache would purge > their cache with (m)(c)(a)time modifications.
I have not observed this degradation on any of the later 3.1.2 builds, since the ctr improvements went in: https://bugzilla.redhat.com/show_bug.cgi?id=1282729#c32.