Bug 1300679
| Summary: | promotions not balanced across hot tier sub-volumes | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Manoj Pillai <mpillai> | |
| Component: | tier | Assignee: | sankarshan <sankarshan> | |
| Status: | CLOSED ERRATA | QA Contact: | krishnaram Karthick <kramdoss> | |
| Severity: | unspecified | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | rhgs-3.1 | CC: | annair, asrivast, byarlaga, dlambrig, mpillai, nbalacha, nchilaka, rcyriac, rhinduja, rhs-bugs, rkavunga, sankarshan, smohan, storage-qa-internal | |
| Target Milestone: | --- | Keywords: | ZStream | |
| Target Release: | RHGS 3.1.3 | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | tier-migration | |||
| Fixed In Version: | glusterfs-3.7.9-1 | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1302772 (view as bug list) | Environment: | ||
| Last Closed: | 2016-06-23 05:03:16 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1268895, 1299184, 1302772, 1306514 | |||
|
Description
Manoj Pillai
2016-01-21 12:40:02 UTC
Additional info: Volume Name: perfvol Type: Tier Volume ID: 2e861e5f-8b01-4b4c-95cf-f6c2775bfe64 Status: Started Number of Bricks: 28 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 2 x 2 = 4 Brick1: gprfc083-10ge:/mnt/rhs_brick2 Brick2: gprfc082-10ge:/mnt/rhs_brick2 Brick3: gprfc083-10ge:/mnt/rhs_brick1 Brick4: gprfc082-10ge:/mnt/rhs_brick1 Cold Tier: Cold Tier Type : Distributed-Disperse Number of Bricks: 2 x (8 + 4) = 24 Brick5: gprfs045-10ge:/mnt/rhs_brick1 Brick6: gprfs046-10ge:/mnt/rhs_brick1 Brick7: gprfs047-10ge:/mnt/rhs_brick1 Brick8: gprfs048-10ge:/mnt/rhs_brick1 Brick9: gprfs045-10ge:/mnt/rhs_brick2 Brick10: gprfs046-10ge:/mnt/rhs_brick2 Brick11: gprfs047-10ge:/mnt/rhs_brick2 Brick12: gprfs048-10ge:/mnt/rhs_brick2 Brick13: gprfs045-10ge:/mnt/rhs_brick3 Brick14: gprfs046-10ge:/mnt/rhs_brick3 Brick15: gprfs047-10ge:/mnt/rhs_brick3 Brick16: gprfs048-10ge:/mnt/rhs_brick3 Brick17: gprfs045-10ge:/mnt/rhs_brick4 Brick18: gprfs046-10ge:/mnt/rhs_brick4 Brick19: gprfs047-10ge:/mnt/rhs_brick4 Brick20: gprfs048-10ge:/mnt/rhs_brick4 Brick21: gprfs045-10ge:/mnt/rhs_brick5 Brick22: gprfs046-10ge:/mnt/rhs_brick5 Brick23: gprfs047-10ge:/mnt/rhs_brick5 Brick24: gprfs048-10ge:/mnt/rhs_brick5 Brick25: gprfs045-10ge:/mnt/rhs_brick6 Brick26: gprfs046-10ge:/mnt/rhs_brick6 Brick27: gprfs047-10ge:/mnt/rhs_brick6 Brick28: gprfs048-10ge:/mnt/rhs_brick6 Options Reconfigured: cluster.tier-demote-frequency: 36000 cluster.tier-promote-frequency: 3000 cluster.tier-mode: cache features.ctr-enabled: on cluster.lookup-optimize: on server.event-threads: 4 client.event-threads: 4 performance.readdir-ahead: on There is no issue in the way the layouts or hashed subvols are being selected for these files. However, what is probably happening is that the distribution of bricks among these nodes means that only the tier process running on gprfs045-10ge is actually migrating any files. Both hot and cold tier dht have exactly 2 subvols and identical directory structures. The layout distribution for these directories on the 2 tiers is also likely to be the same Tier queries the bricks sequentially to create the query file, so all entries from local brick1 are retrieved first, followed by the next local brick and so on. These entries are appended to the query file, so the initial entries will all be from brick1 (and hence subvol0). The query file is processed sequentially. So all entries from subvol0 are processed for migration. These files are all likely to go to the same subvol in the target tier because of the layout distribution for the parent dir being the same for both tiers. Hence the data distribution in the hot tier is skewed and only one subvol has entries. The current theory is that before the entries for subvol1 from the query file can be processed, tier hits the max-files or max-mb limits and those files are never migrated. In a worst case scenario with a large number of files returned by each subvol in every cycle, we could end up in a scenario where only the files from a particular subvol are ever promoted, as we do not shuffle entries returned by the DBs on different bricks. Moving this to Joseph so he can update with possible solutions. Repeated the test in comment #0, with some changes: Data set that is candidate for promotion: 16MB file size, 2k files, total data set size 32GB. cluster.tier-max-mb 40000 (changed from default) cluster.tier-max-files 50000 (default) So if migration works correctly, all 32GB of data should get promoted. Observing state of bricks on one of the hot tier servers at different points in time: Before promotion starts: /dev/mapper/rhsvg1-rhslv1 181G 133M 181G 1% /mnt/rhs_brick1 /dev/mapper/rhsvg2-rhslv2 181G 133M 181G 1% /mnt/rhs_brick2 [both bricks empty] Intermediate point 1: /dev/mapper/rhsvg1-rhslv1 181G 133M 181G 1% /mnt/rhs_brick1 /dev/mapper/rhsvg2-rhslv2 181G 15G 167G 8% /mnt/rhs_brick2 [15g promoted, all going to brick2] Intermediate point 2: /dev/mapper/rhsvg1-rhslv1 181G 12G 170G 7% /mnt/rhs_brick1 /dev/mapper/rhsvg2-rhslv2 181G 17G 165G 9% /mnt/rhs_brick2 [brick2 has got its complete share, roughly half of the 32g. now we are seeing files getting promoted to brick1 as well] Final state: /dev/mapper/rhsvg1-rhslv1 181G 17G 165G 9% /mnt/rhs_brick1 /dev/mapper/rhsvg2-rhslv2 181G 17G 165G 9% /mnt/rhs_brick2 [all 32g promoted, divided equally between the two bricks] Clearly this style of promoting files serially (first all files to one subvol, then the next) is a big problem, a potential performance killer, as file distribution on hot tier bricks will not be balanced. Currently a single file is used for all sub volumes. We can make a change to have separate files, one for each sub volume. While selecting files to promote we will iterate through each of the files in round robin fashion, so the contribution of each sub volume shall be balanced. Can we please have the patch details updated in the bug which fixed the issue. This was the part of the rebase of 3.1.3 from 3.7.x, so we dont have the separate patch for this in 3.1.3 Verified this bug in build glusterfs-server-3.7.9-1.el7rhgs.x86_64 Steps followed: 1) create a dist-rep volume (4x2) 2) create 100 files 3) Attach tier - dist-rep (4x2) 4) heat all 100 files created in step-2 5) check if the files in hot tier are distributed across all sub-vols In the test above, the files were distributed almost equally across all sub-vols. Marking the bug as verified. However with max files being reached in every cycle, we might still end up in uneven distribution of files. A separate bug has been filed to track the scenario of tier-max.mb being reached in every cycle - https://bugzilla.redhat.com/show_bug.cgi?id=1328721 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240 |