Bug 1300679 - promotions not balanced across hot tier sub-volumes
promotions not balanced across hot tier sub-volumes
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: tier (Show other bugs)
x86_64 Linux
unspecified Severity unspecified
: ---
: RHGS 3.1.3
Assigned To: sankarshan
krishnaram Karthick
: ZStream
Depends On:
Blocks: 1268895 1299184 1302772 1306514
  Show dependency treegraph
Reported: 2016-01-21 07:40 EST by Manoj Pillai
Modified: 2016-09-17 11:34 EDT (History)
14 users (show)

See Also:
Fixed In Version: glusterfs-3.7.9-1
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1302772 (view as bug list)
Last Closed: 2016-06-23 01:03:16 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Manoj Pillai 2016-01-21 07:40:02 EST
Description of problem:
Tiered volume with 2x(8+4) as cold tier and 2x2 as hot tier, it is observed that all promotions were to one of the two hot-tier sub-volumes. This results in an unbalanced distribution of files among the hot-tier bricks.

Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux Server release 7.2

How reproducible:

Steps to Reproduce:
1. create a tiered volume. in this test hot-tier is 2x2 and has a capacity of 360GB. demote-frequency was set to a very high value (36000). promote frequency was set to 3000, which is slightly larger than the total time for steps 2 and 3 (data creation) to complete.

2. create a data set of size > hot tier capacity (> 360GB, in this test)  to use up space all space on the hot tier. use large files so data set creation does not take a long time. files are created in sub-directory init_files under mount point.

3. create a data set of small files (in this test, 64KB file size was used, data set of 32GB i.e. half million files). these files are created in sub-directory data_files under mount point. since hot-tier capacity was used up in the previous step these files should get created in the cold tier.

4. delete sub-directory init_files, freeing up space in the hot tier.

5. wait for promotion to kick in and observe result of promotion.

Actual results:

hot tier space usage before promotion:
(this is immediately after deleting the sub-directory init_files)
/dev/mapper/rhsvg1-rhslv1       181G  149M  181G   1% /mnt/rhs_brick1
/dev/mapper/rhsvg2-rhslv2       181G  149M  181G   1% /mnt/rhs_brick2

hot tier space usage after promotion:
(this is after waiting for a while and observing that we have reached stable state)
/dev/mapper/rhsvg1-rhslv1       181G  149M  181G   1% /mnt/rhs_brick1
/dev/mapper/rhsvg2-rhslv2       181G  3.3G  178G   2% /mnt/rhs_brick2

That data set is 64KB files. cluster.tier-max-files is at default i.e. 50K. So promotion will move at most 3200MB. So all the data promoted has gone to one of the two bricks /mnt/rhs_bricks2, while the other remained empty.

Expected results:
promotion should select files so that they are more or less evenly distributed among hot tier sub-volumes.

Additional info:
This was an actual test used to measure promotion speed. should be able to reproduce the problem with a simpler test.
Comment 2 Nithya Balachandran 2016-01-22 05:44:03 EST
Additional info:

Volume Name: perfvol
Type: Tier
Volume ID: 2e861e5f-8b01-4b4c-95cf-f6c2775bfe64
Status: Started
Number of Bricks: 28
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick1: gprfc083-10ge:/mnt/rhs_brick2
Brick2: gprfc082-10ge:/mnt/rhs_brick2
Brick3: gprfc083-10ge:/mnt/rhs_brick1
Brick4: gprfc082-10ge:/mnt/rhs_brick1
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 2 x (8 + 4) = 24
Brick5: gprfs045-10ge:/mnt/rhs_brick1
Brick6: gprfs046-10ge:/mnt/rhs_brick1
Brick7: gprfs047-10ge:/mnt/rhs_brick1
Brick8: gprfs048-10ge:/mnt/rhs_brick1
Brick9: gprfs045-10ge:/mnt/rhs_brick2
Brick10: gprfs046-10ge:/mnt/rhs_brick2
Brick11: gprfs047-10ge:/mnt/rhs_brick2
Brick12: gprfs048-10ge:/mnt/rhs_brick2
Brick13: gprfs045-10ge:/mnt/rhs_brick3
Brick14: gprfs046-10ge:/mnt/rhs_brick3
Brick15: gprfs047-10ge:/mnt/rhs_brick3
Brick16: gprfs048-10ge:/mnt/rhs_brick3
Brick17: gprfs045-10ge:/mnt/rhs_brick4
Brick18: gprfs046-10ge:/mnt/rhs_brick4
Brick19: gprfs047-10ge:/mnt/rhs_brick4
Brick20: gprfs048-10ge:/mnt/rhs_brick4
Brick21: gprfs045-10ge:/mnt/rhs_brick5
Brick22: gprfs046-10ge:/mnt/rhs_brick5
Brick23: gprfs047-10ge:/mnt/rhs_brick5
Brick24: gprfs048-10ge:/mnt/rhs_brick5
Brick25: gprfs045-10ge:/mnt/rhs_brick6
Brick26: gprfs046-10ge:/mnt/rhs_brick6
Brick27: gprfs047-10ge:/mnt/rhs_brick6
Brick28: gprfs048-10ge:/mnt/rhs_brick6
Options Reconfigured:
cluster.tier-demote-frequency: 36000
cluster.tier-promote-frequency: 3000
cluster.tier-mode: cache
features.ctr-enabled: on
cluster.lookup-optimize: on
server.event-threads: 4
client.event-threads: 4
performance.readdir-ahead: on

There is no issue in the way the layouts or hashed subvols are being selected for these files. However, what is probably happening is that the distribution of bricks among these nodes means that only the tier process running on gprfs045-10ge is actually migrating any files.

Both hot and cold tier dht  have exactly 2 subvols and identical directory structures. The layout distribution for these directories on the 2 tiers is also likely to be the same

Tier queries the bricks sequentially to create the query file, so all entries from local brick1 are retrieved first, followed by the next local brick and so on. These entries are appended to the query file, so the initial entries will all be from brick1 (and hence subvol0).

The query file is processed sequentially. So all entries from subvol0 are processed for migration. These files are all likely to go to the same subvol in the target tier because of the layout distribution for the parent dir being the same for both tiers. Hence the data distribution in the hot tier is skewed and only one subvol has entries.

The current theory is that before the entries for subvol1 from the query file can be processed, tier hits the max-files or max-mb limits and those files are never migrated. 

In a worst case scenario with a large number of files returned by each subvol in every cycle, we could end up in a scenario where only the files from a particular subvol are ever promoted, as we do not shuffle entries returned by the DBs on different bricks.

Moving this to Joseph so he can update with possible solutions.
Comment 3 Manoj Pillai 2016-01-23 02:31:40 EST
Repeated the test in comment #0, with some changes: 
Data set that is candidate for promotion: 16MB file size, 2k files, total data set size 32GB.
cluster.tier-max-mb 40000 (changed from default)
cluster.tier-max-files 50000 (default)
So if migration works correctly, all 32GB of data should get promoted.

Observing state of bricks on one of the hot tier servers at different points in time:

Before promotion starts:
/dev/mapper/rhsvg1-rhslv1       181G  133M  181G   1% /mnt/rhs_brick1
/dev/mapper/rhsvg2-rhslv2       181G  133M  181G   1% /mnt/rhs_brick2
[both bricks empty]

Intermediate point 1:
/dev/mapper/rhsvg1-rhslv1       181G  133M  181G   1% /mnt/rhs_brick1
/dev/mapper/rhsvg2-rhslv2       181G   15G  167G   8% /mnt/rhs_brick2
[15g promoted, all going to brick2]

Intermediate point 2:
/dev/mapper/rhsvg1-rhslv1       181G   12G  170G   7% /mnt/rhs_brick1
/dev/mapper/rhsvg2-rhslv2       181G   17G  165G   9% /mnt/rhs_brick2
[brick2 has got its complete share, roughly half of the 32g. now we are seeing files getting promoted to brick1 as well]

Final state:
/dev/mapper/rhsvg1-rhslv1       181G   17G  165G   9% /mnt/rhs_brick1
/dev/mapper/rhsvg2-rhslv2       181G   17G  165G   9% /mnt/rhs_brick2
[all 32g promoted, divided equally between the two bricks]

Clearly this style of promoting files serially (first all files to one subvol, then the next) is a big problem, a potential performance killer, as file distribution on hot tier bricks will not be balanced.
Comment 4 Dan Lambright 2016-01-23 15:27:07 EST
Currently a single file is used for all sub volumes. We can make a change to have separate files, one for each sub volume.

While selecting files to promote we will iterate through each of the files in round robin fashion, so the contribution of each sub volume shall be balanced.
Comment 26 krishnaram Karthick 2016-04-11 04:54:35 EDT
Can we please have the patch details updated in the bug which fixed the issue.
Comment 27 Joseph Elwin Fernandes 2016-04-16 02:36:04 EDT
This was the part of the rebase of 3.1.3 from 3.7.x, so we dont have the separate patch for this in 3.1.3
Comment 29 krishnaram Karthick 2016-04-21 02:04:58 EDT
Verified this bug in build glusterfs-server-3.7.9-1.el7rhgs.x86_64

Steps followed:
1) create a dist-rep volume (4x2)
2) create 100 files 
3) Attach tier - dist-rep (4x2)
4) heat all 100 files created in step-2
5) check if the files in hot tier are distributed across all sub-vols

In the test above, the files were distributed almost equally across all sub-vols. 

Marking the bug as verified.

However with max files being reached in every cycle, we might still end up in uneven distribution of files. A separate bug has been filed to track the scenario of tier-max.mb being reached in every cycle - https://bugzilla.redhat.com/show_bug.cgi?id=1328721
Comment 33 errata-xmlrpc 2016-06-23 01:03:16 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.