Bug 1293967 - severe degradation in I/O performance when tests involve migration
Summary: severe degradation in I/O performance when tests involve migration
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: tier
Version: unspecified
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Dan Lambright
QA Contact: Nag Pavan Chilakam
URL:
Whiteboard: tier-performance
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-12-23 18:40 UTC by Manoj Pillai
Modified: 2016-09-17 15:39 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-03-07 10:47:40 UTC
Embargoed:


Attachments (Terms of Use)

Description Manoj Pillai 2015-12-23 18:40:45 UTC
Description of problem:

Performance of I/O to a tiered volume is seen to be extremely poor in some cases. When the migration-related tunable parameters (particularly cluster.tier-*-frequency) are set to prevent promotion/demotion from kicking in for the duration of the test, performance improves significantly.

Version-Release number of selected component (if applicable):

glusterfs-server-3.7.5-7.el7rhgs.x86_64
Red Hat Enterprise Linux Server release 7.1 (Maipo)


How reproducible:
Consistently.

Steps to Reproduce:
I'm listing the steps in the test that showed the problem. We should be able to reproduce the problem with a much simpler test, but for now listing current test details.

1. 
create 2x(8+4) base volume (about 15TB capacity); attach 2x2 SAS-SSD as hot tier (about 360GB capacity). fuse mount on a set of clients.

2. 
create directory smf_init in the mount point. create a data set of size 480GB within directory smf-init, of large files each 256MB in size. This fills up the hot tier to the max allowed.

3.
create a directory smf_data in the mount point. create a data set of size 32GB within smf_data, of small files each 64KB. rm -rf <mnt-pt>/smf_init. this deletes all files created in step 1 and creates space within the hot tier.

4. read files in the directory <mnt-pt>/smf_data and record throughput of reads.

Actual results:
create phase for small files in smf_data reported throughput of 2.2 MB/s. read phase for small files in smf_data reported thoughput of 5MB/s.

in comparison, create phase on base volume (disperse volume) reported 102MB/s and read phase on base volume reported 196MB/s.

Expected results:

For this test, we should get close to base volume performance.

Additional info:

Comment 2 Manoj Pillai 2015-12-23 18:56:26 UTC
Some comparitive performance results for different configurations that point to where the problems might lie.

Performance when tests are run on the base volume (2x(8+4)):
small-file create phase: 102.727173 MB/sec
small-file read phase: 196.626994 MB/sec

Performance on base volume + ctr enabled:
create phase: 23.496851 MB/sec
read phase: 147.093099 MB/sec

[clearly, ctr is a big part of the overall problem, but there are other bzs to track ctr problems]

Performance on tiered volume, with following settings:
RHS_TIER_CTR="on" # features.ctr-enabled
RHS_TIER_MODE="cache" # cluster.tier-mode
RHS_TIER_RC="on"  # features.record-counters
RHS_TIER_WFT=1024   # cluster.write-freq-threshold
RHS_TIER_RFT=1024   # cluster.read-freq-threshold
RHS_TIER_PF=600  # cluster.tier-promote-frequency, seconds
RHS_TIER_DF=600  # cluster.tier-demote-frequency, seconds

create phase: 2.284576 MB/sec
read phase: 4.997511 MB/sec

Performance on tiered volume, with following settings:
RHS_TIER_CTR="on" # features.ctr-enabled
RHS_TIER_MODE="cache" # cluster.tier-mode
RHS_TIER_RC="on"  # features.record-counters
RHS_TIER_WFT=8192   # cluster.write-freq-threshold
RHS_TIER_RFT=8192   # cluster.read-freq-threshold
RHS_TIER_PF=36000  # cluster.tier-promote-frequency, seconds
RHS_TIER_DF=36000  # cluster.tier-demote-frequency, seconds

create phase: 23.675706 MB/sec
read phase: 149.278938 MB/sec

[So bumping up tier*frequency parameters so that migration does not kick in lifts performance up to the level of base volume + ctr enabled on].

Comment 7 Raghavendra G 2016-02-03 06:19:36 UTC
Though not directly causing the issue raised in this bug, I observed following things which is relevant during file promotion/demotion in tier:

1. rebalance changes ctime/atime/mtime of a file (even with no access from application) as it does setxattr to convert src to linkto file (this is through code reading, but didn't test it). This will result in md-cache to do invalidation of inode. However, since fuse-bridge doesn't by default send invalidation notification to kernel by default (this behaviour is controlled by option "fopen-keep-cache", which is "off" by default), this doesn't cause much of an issue. If fopen-keep-cache is turned on, a single promotion/demotion has the effect of purging _all_ (data/metadata) cache from kernel and glusterfs client.

Even with fopen-keep-cache turned off, md-cache and io-cache would purge their cache with (m)(c)(a)time modifications.

Comment 8 Raghavendra G 2016-02-03 06:35:05 UTC
Did some tests to corroborate my hypothesis in comment 7:

[root@unused raghu]# gluster volume info dist
 
Volume Name: dist
Type: Distribute
Volume ID: 31f2c96d-6153-43dc-aeec-8a37d879eab4
Status: Started
Number of Bricks: 5
Transport-type: tcp
Bricks:
Brick1: booradley:/home/export/dist1
Brick2: booradley:/home/export/dist2
Brick3: booradley:/home/export/dist3
Brick4: booradley:/home/export/dist4
Brick5: booradley:/home/export/dist5
Options Reconfigured:
performance.readdir-ahead: on

[root@unused raghu]# ls /home/export/dist[1-5]/ -l
/home/export/dist1/:
total 1028
-rw-r--r--. 2 root root 1048576 Feb  3 11:55 1

/home/export/dist2/:
total 4
---------T. 2 root root 0 Feb  3 11:54 1

/home/export/dist3/:
total 0

/home/export/dist4/:
total 0

/home/export/dist5/:
total 0

[root@unused raghu]# stat /home/export/dist[1][2]/1
stat: cannot stat `/home/export/dist[1][2]/1': No such file or directory
[root@unused raghu]# stat /home/export/dist[1-2]/1
  File: `/home/export/dist1/1'
  Size: 1048576   	Blocks: 2056       IO Block: 4096   regular file
Device: fd02h/64770d	Inode: 9834399     Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Context: unconfined_u:object_r:home_root_t:s0
Access: 2016-02-03 11:54:57.000000000 +0530
Modify: 2016-02-03 11:55:48.232104215 +0530
Change: 2016-02-03 11:55:48.232104215 +0530
 Birth: -
  File: `/home/export/dist2/1'
  Size: 0         	Blocks: 8          IO Block: 4096   regular empty file
Device: fd02h/64770d	Inode: 9834398     Links: 2
Access: (1000/---------T)  Uid: (    0/    root)   Gid: (    0/    root)
Context: unconfined_u:object_r:home_root_t:s0
Access: 2016-02-03 11:54:57.435058309 +0530
Modify: 2016-02-03 11:54:57.435058309 +0530
Change: 2016-02-03 11:54:57.481058349 +0530
 Birth: -

[root@unused raghu]# date
Wed Feb  3 11:57:30 IST 2016

[root@unused raghu]# gluster volume rebalance dist start force
volume rebalance: dist: success: Rebalance on dist has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 4634a88d-f382-42a1-859c-41063cd11da2

[root@unused raghu]# gluster volume rebalance dist status 
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                1         1.0MB             2             0             0            completed               0.00
volume rebalance: dist: success

[root@unused raghu]# stat /home/export/dist[1-2]/1
  File: `/home/export/dist2/1'
  Size: 1048576   	Blocks: 2056       IO Block: 4096   regular file
Device: fd02h/64770d	Inode: 9834398     Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Context: unconfined_u:object_r:home_root_t:s0
Access: 2016-02-03 11:58:31.729252000 +0530
Modify: 2016-02-03 11:55:48.232104000 +0530
Change: 2016-02-03 11:58:31.847252648 +0530
 Birth: -

As can be seen above, rebalance has changed the (m)(c)time of file on brick.

And following is the definition of mdc_inode_iatt_set_validate:

int
mdc_inode_iatt_set_validate(xlator_t *this, inode_t *inode, struct iatt *prebuf,
                            struct iatt *iatt)
{
        int              ret = -1;
        struct md_cache *mdc = NULL;

	mdc = mdc_inode_prep (this, inode);
        if (!mdc)
                goto out;

        LOCK (&mdc->lock);
        {
                if (!iatt || !iatt->ia_ctime) {
                        mdc->ia_time = 0;
                        goto unlock;
                }

		/*                                                                                                                                                      
                 * Invalidate the inode if the mtime or ctime has changed                                                                                               
                 * and the prebuf doesn't match the value we have cached.                                                                                               
                 * TODO: writev returns with a NULL iatt due to                                                                                                         
                 * performance/write-behind, causing invalidation on writes.                                                                                            
                 */
                if (IA_ISREG(inode->ia_type) &&
                    ((iatt->ia_mtime != mdc->md_mtime) ||
                    (iatt->ia_mtime_nsec != mdc->md_mtime_nsec) ||
                    (iatt->ia_ctime != mdc->md_ctime) ||
                    (iatt->ia_ctime_nsec != mdc->md_ctime_nsec)))
	                if (!prebuf || (prebuf->ia_ctime != mdc->md_ctime) ||
                            (prebuf->ia_ctime_nsec != mdc->md_ctime_nsec) ||
                            (prebuf->ia_mtime != mdc->md_mtime) ||
                            (prebuf->ia_mtime_nsec != mdc->md_mtime_nsec))
				inode_invalidate(inode);

                mdc_from_iatt (mdc, iatt);

                time (&mdc->ia_time);
        }
unlock:
        ret = 0;
out:
        return ret;
}

To be accurate I did this test only with rebalance, but I am assuming tier might also suffer from similar issue as both share significant chunk of code for file migration.

Comment 9 Raghavendra G 2016-02-22 04:13:55 UTC
(In reply to Raghavendra G from comment #7)
> Though not directly causing the issue raised in this bug, I observed
> following things which is relevant during file promotion/demotion in tier:
> 
> 1. rebalance changes ctime/atime/mtime of a file (even with no access from
> application) as it does setxattr to convert src to linkto file (this is
> through code reading, but didn't test it). This will result in md-cache to
> do invalidation of inode. However, since fuse-bridge doesn't by default send
> invalidation notification to kernel by default (this behaviour is controlled
> by option "fopen-keep-cache", which is "off" by default), this doesn't cause
> much of an issue. 

In modern kernels (RHEL-7) this option will be turned on by default. We turn this option on/off based on whether the kernel gluster is running on supports invalidation or not. So, this option is turned on for rhgs-3.1 onwards I suppose.

> If fopen-keep-cache is turned on, a single
> promotion/demotion has the effect of purging _all_ (data/metadata) cache
> from kernel and glusterfs client.

Given that this option can be turned on for significant number of use-cases, the performance hit might be huge for fuse-clients (purging _all_ cache - pages, dentry caching etc - in kernel and glusterfs can have a large impact I suppose).

> 
> Even with fopen-keep-cache turned off, md-cache and io-cache would purge
> their cache with (m)(c)(a)time modifications.

Comment 10 Manoj Pillai 2016-03-07 10:47:40 UTC
I have not observed this degradation on any of the later 3.1.2 builds, since the ctr improvements went in: https://bugzilla.redhat.com/show_bug.cgi?id=1282729#c32.


Note You need to log in before you can comment on or make changes to this bug.