Description of problem: Tiered volume with 2x(8+4) as cold tier and 2x2 as hot tier, it is observed that all promotions were to one of the two hot-tier sub-volumes. This results in an unbalanced distribution of files among the hot-tier bricks. Version-Release number of selected component (if applicable): glusterfs*-3.7.5-15.el7rhgs.x86_64 Red Hat Enterprise Linux Server release 7.2 How reproducible: Consistently Steps to Reproduce: 1. create a tiered volume. in this test hot-tier is 2x2 and has a capacity of 360GB. demote-frequency was set to a very high value (36000). promote frequency was set to 3000, which is slightly larger than the total time for steps 2 and 3 (data creation) to complete. 2. create a data set of size > hot tier capacity (> 360GB, in this test) to use up space all space on the hot tier. use large files so data set creation does not take a long time. files are created in sub-directory init_files under mount point. 3. create a data set of small files (in this test, 64KB file size was used, data set of 32GB i.e. half million files). these files are created in sub-directory data_files under mount point. since hot-tier capacity was used up in the previous step these files should get created in the cold tier. 4. delete sub-directory init_files, freeing up space in the hot tier. 5. wait for promotion to kick in and observe result of promotion. Actual results: hot tier space usage before promotion: (this is immediately after deleting the sub-directory init_files) /dev/mapper/rhsvg1-rhslv1 181G 149M 181G 1% /mnt/rhs_brick1 /dev/mapper/rhsvg2-rhslv2 181G 149M 181G 1% /mnt/rhs_brick2 hot tier space usage after promotion: (this is after waiting for a while and observing that we have reached stable state) /dev/mapper/rhsvg1-rhslv1 181G 149M 181G 1% /mnt/rhs_brick1 /dev/mapper/rhsvg2-rhslv2 181G 3.3G 178G 2% /mnt/rhs_brick2 That data set is 64KB files. cluster.tier-max-files is at default i.e. 50K. So promotion will move at most 3200MB. So all the data promoted has gone to one of the two bricks /mnt/rhs_bricks2, while the other remained empty. Expected results: promotion should select files so that they are more or less evenly distributed among hot tier sub-volumes. Additional info: This was an actual test used to measure promotion speed. should be able to reproduce the problem with a simpler test.
Additional info: Volume Name: perfvol Type: Tier Volume ID: 2e861e5f-8b01-4b4c-95cf-f6c2775bfe64 Status: Started Number of Bricks: 28 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 2 x 2 = 4 Brick1: gprfc083-10ge:/mnt/rhs_brick2 Brick2: gprfc082-10ge:/mnt/rhs_brick2 Brick3: gprfc083-10ge:/mnt/rhs_brick1 Brick4: gprfc082-10ge:/mnt/rhs_brick1 Cold Tier: Cold Tier Type : Distributed-Disperse Number of Bricks: 2 x (8 + 4) = 24 Brick5: gprfs045-10ge:/mnt/rhs_brick1 Brick6: gprfs046-10ge:/mnt/rhs_brick1 Brick7: gprfs047-10ge:/mnt/rhs_brick1 Brick8: gprfs048-10ge:/mnt/rhs_brick1 Brick9: gprfs045-10ge:/mnt/rhs_brick2 Brick10: gprfs046-10ge:/mnt/rhs_brick2 Brick11: gprfs047-10ge:/mnt/rhs_brick2 Brick12: gprfs048-10ge:/mnt/rhs_brick2 Brick13: gprfs045-10ge:/mnt/rhs_brick3 Brick14: gprfs046-10ge:/mnt/rhs_brick3 Brick15: gprfs047-10ge:/mnt/rhs_brick3 Brick16: gprfs048-10ge:/mnt/rhs_brick3 Brick17: gprfs045-10ge:/mnt/rhs_brick4 Brick18: gprfs046-10ge:/mnt/rhs_brick4 Brick19: gprfs047-10ge:/mnt/rhs_brick4 Brick20: gprfs048-10ge:/mnt/rhs_brick4 Brick21: gprfs045-10ge:/mnt/rhs_brick5 Brick22: gprfs046-10ge:/mnt/rhs_brick5 Brick23: gprfs047-10ge:/mnt/rhs_brick5 Brick24: gprfs048-10ge:/mnt/rhs_brick5 Brick25: gprfs045-10ge:/mnt/rhs_brick6 Brick26: gprfs046-10ge:/mnt/rhs_brick6 Brick27: gprfs047-10ge:/mnt/rhs_brick6 Brick28: gprfs048-10ge:/mnt/rhs_brick6 Options Reconfigured: cluster.tier-demote-frequency: 36000 cluster.tier-promote-frequency: 3000 cluster.tier-mode: cache features.ctr-enabled: on cluster.lookup-optimize: on server.event-threads: 4 client.event-threads: 4 performance.readdir-ahead: on There is no issue in the way the layouts or hashed subvols are being selected for these files. However, what is probably happening is that the distribution of bricks among these nodes means that only the tier process running on gprfs045-10ge is actually migrating any files. Both hot and cold tier dht have exactly 2 subvols and identical directory structures. The layout distribution for these directories on the 2 tiers is also likely to be the same Tier queries the bricks sequentially to create the query file, so all entries from local brick1 are retrieved first, followed by the next local brick and so on. These entries are appended to the query file, so the initial entries will all be from brick1 (and hence subvol0). The query file is processed sequentially. So all entries from subvol0 are processed for migration. These files are all likely to go to the same subvol in the target tier because of the layout distribution for the parent dir being the same for both tiers. Hence the data distribution in the hot tier is skewed and only one subvol has entries. The current theory is that before the entries for subvol1 from the query file can be processed, tier hits the max-files or max-mb limits and those files are never migrated. In a worst case scenario with a large number of files returned by each subvol in every cycle, we could end up in a scenario where only the files from a particular subvol are ever promoted, as we do not shuffle entries returned by the DBs on different bricks. Moving this to Joseph so he can update with possible solutions.
Repeated the test in comment #0, with some changes: Data set that is candidate for promotion: 16MB file size, 2k files, total data set size 32GB. cluster.tier-max-mb 40000 (changed from default) cluster.tier-max-files 50000 (default) So if migration works correctly, all 32GB of data should get promoted. Observing state of bricks on one of the hot tier servers at different points in time: Before promotion starts: /dev/mapper/rhsvg1-rhslv1 181G 133M 181G 1% /mnt/rhs_brick1 /dev/mapper/rhsvg2-rhslv2 181G 133M 181G 1% /mnt/rhs_brick2 [both bricks empty] Intermediate point 1: /dev/mapper/rhsvg1-rhslv1 181G 133M 181G 1% /mnt/rhs_brick1 /dev/mapper/rhsvg2-rhslv2 181G 15G 167G 8% /mnt/rhs_brick2 [15g promoted, all going to brick2] Intermediate point 2: /dev/mapper/rhsvg1-rhslv1 181G 12G 170G 7% /mnt/rhs_brick1 /dev/mapper/rhsvg2-rhslv2 181G 17G 165G 9% /mnt/rhs_brick2 [brick2 has got its complete share, roughly half of the 32g. now we are seeing files getting promoted to brick1 as well] Final state: /dev/mapper/rhsvg1-rhslv1 181G 17G 165G 9% /mnt/rhs_brick1 /dev/mapper/rhsvg2-rhslv2 181G 17G 165G 9% /mnt/rhs_brick2 [all 32g promoted, divided equally between the two bricks] Clearly this style of promoting files serially (first all files to one subvol, then the next) is a big problem, a potential performance killer, as file distribution on hot tier bricks will not be balanced.
Currently a single file is used for all sub volumes. We can make a change to have separate files, one for each sub volume. While selecting files to promote we will iterate through each of the files in round robin fashion, so the contribution of each sub volume shall be balanced.
Can we please have the patch details updated in the bug which fixed the issue.
This was the part of the rebase of 3.1.3 from 3.7.x, so we dont have the separate patch for this in 3.1.3
Verified this bug in build glusterfs-server-3.7.9-1.el7rhgs.x86_64 Steps followed: 1) create a dist-rep volume (4x2) 2) create 100 files 3) Attach tier - dist-rep (4x2) 4) heat all 100 files created in step-2 5) check if the files in hot tier are distributed across all sub-vols In the test above, the files were distributed almost equally across all sub-vols. Marking the bug as verified. However with max files being reached in every cycle, we might still end up in uneven distribution of files. A separate bug has been filed to track the scenario of tier-max.mb being reached in every cycle - https://bugzilla.redhat.com/show_bug.cgi?id=1328721
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240