Bug 1455241
| Summary: | [Scale] : Rebalance start force is skipping files. | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Ambarish <asoman> |
| Component: | distribute | Assignee: | Nithya Balachandran <nbalacha> |
| Status: | CLOSED ERRATA | QA Contact: | Ambarish <asoman> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | rhgs-3.3 | CC: | amukherj, asoman, bturner, nbalacha, rhinduja, rhs-bugs, skoduri, spalai, storage-qa-internal |
| Target Milestone: | --- | Keywords: | Regression |
| Target Release: | RHGS 3.3.0 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | glusterfs-3.8.4-26 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-09-21 04:45:37 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1447559 | ||
| Bug Blocks: | 1417151 | ||
Looks like a regression,I didn't see this happen on 3.2. Unsure what dev build introduced this,though. [root@gqac011 gluster-mount]# find . -mindepth 1 -type f -links +1 [root@gqac011 gluster-mount]# There are no hardlinks. I did a quick test on *2,I could not repro the error :
[root@gqas013 ~]# gluster v rebalance test status
Node Rebalanced-files size scanned failures skipped status run time in h:m:s
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 34051 6.8MB 121275 0 0 completed 0:07:02
gqas005.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 completed 0:00:22
gqas014.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 completed 0:00:22
gqas015.sbu.lab.eng.bos.redhat.com 13782 745.9MB 50203 0 0 completed 0:03:28
volume rebalance: test: success
[root@gqas013 ~]# gluster v info
Volume Name: test
Type: Distributed-Replicate
Volume ID: 61d155ca-05cc-4ad0-8488-aaeb0e829b91
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: gqas013:/bricks1/A
Brick2: gqas014:/bricks1/A
Brick3: gqas015:/bricks1/A
Brick4: gqas005:/bricks1/A
Brick5: gqas013:/bricks4/Am
Brick6: gqas015:/bricks4/A
Brick7: gqas013:/bricks8/Am
Brick8: gqas015:/bricks8/A
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
[root@gqas013 ~]#
This is also due to the fallocate BZ (). As fallocate fails, dht_migrate_file returns with -1.
Since the ret code is -1, the task completion function sets the
2132 static int
2133 rebalance_task_completion (int op_ret, call_frame_t *sync_frame, void *data)
2134 {
2135 int32_t op_errno = EINVAL;
2136
2137 if (op_ret == -1) {
2138 /* Failure of migration process, mostly due to write process.
(gdb)
2139 as we can't preserve the exact errno, lets say there was
2140 no space to migrate-data
2141 */
2142 op_errno = ENOSPC;
2143 }
If the op_errno is ENOSPC, dht believes the migration has been skipped.
I am marking this depends on BZ 1447559. This can be retested on the build with the fix for 1447559. There is nothing to be changed in DHT for this.
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/107051/ Works fine on glusterfs-3.8.4-32. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 |
Description of problem: ------------------------ 9*(4+2) volume,added 6 bricks. Triggered rebalance start force. I see a huge number of files being skipped. Files should _not_ be skipped with the "force" option,especially when I have lots of space on my bricks : [root@gqas013 glusterfs]# gluster v rebalance khal status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 180 3.8KB 488886 0 173827 in progress 1:27:20 gqas005.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 1:27:20 gqas006.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 1:27:20 gqas008.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 1:27:20 gqas014.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:00 gqas015.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 in progress 0:00:00 Estimated time left for rebalance to complete : 3:04:19 volume rebalance: khal: success [root@gqas013 glusterfs]# Version-Release number of selected component (if applicable): -------------------------------------------------------------- 3.8.4-25 How reproducible: ----------------- 100% on my setup. Additional info: ---------------- [root@gqas013 glusterfs]# gluster v info Volume Name: khal Type: Distributed-Disperse Volume ID: 415b2241-0f83-4339-a558-257212fe8682 Status: Started Snapshot Count: 0 Number of Bricks: 10 x (4 + 2) = 60 Transport-type: tcp Bricks: Brick1: gqas013:/bricks1/1 Brick2: gqas014:/bricks1/1 Brick3: gqas015:/bricks1/1 Brick4: gqas005:/bricks1/1 Brick5: gqas006:/bricks1/1 Brick6: gqas008:/bricks1/1 Brick7: gqas013:/bricks2/1 Brick8: gqas014:/bricks2/1 Brick9: gqas015:/bricks2/1 Brick10: gqas005:/bricks2/1 Brick11: gqas006:/bricks2/1 Brick12: gqas008:/bricks2/1 Brick13: gqas013:/bricks3/1 Brick14: gqas014:/bricks3/1 Brick15: gqas015:/bricks3/1 Brick16: gqas005:/bricks3/1 Brick17: gqas006:/bricks3/1 Brick18: gqas008:/bricks3/1 Brick19: gqas013:/bricks4/1 Brick20: gqas014:/bricks4/1 Brick21: gqas015:/bricks4/1 Brick22: gqas005:/bricks4/1 Brick23: gqas006:/bricks4/1 Brick24: gqas008:/bricks4/1 Brick25: gqas013:/bricks5/1 Brick26: gqas014:/bricks5/1 Brick27: gqas015:/bricks5/1 Brick28: gqas005:/bricks5/1 Brick29: gqas006:/bricks5/1 Brick30: gqas008:/bricks5/1 Brick31: gqas013:/bricks6/1 Brick32: gqas014:/bricks6/1 Brick33: gqas015:/bricks6/1 Brick34: gqas005:/bricks6/1 Brick35: gqas006:/bricks6/1 Brick36: gqas008:/bricks6/1 Brick37: gqas013:/bricks7/1 Brick38: gqas014:/bricks7/1 Brick39: gqas015:/bricks7/1 Brick40: gqas005:/bricks7/1 Brick41: gqas006:/bricks7/1 Brick42: gqas008:/bricks7/1 Brick43: gqas013:/bricks8/1 Brick44: gqas014:/bricks8/1 Brick45: gqas015:/bricks8/1 Brick46: gqas005:/bricks8/1 Brick47: gqas006:/bricks8/1 Brick48: gqas008:/bricks8/1 Brick49: gqas013:/bricks9/1 Brick50: gqas014:/bricks9/1 Brick51: gqas015:/bricks9/1 Brick52: gqas005:/bricks9/1 Brick53: gqas006:/bricks9/1 Brick54: gqas008:/bricks9/1 Brick55: gqas013:/bricks10/1 Brick56: gqas014:/bricks10/1 Brick57: gqas015:/bricks10/1 Brick58: gqas005:/bricks10/1 Brick59: gqas006:/bricks10/1 Brick60: gqas008:/bricks10/1 Options Reconfigured: network.inode-lru-limit: 50000 performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on client.event-threads: 4 server.event-threads: 4 cluster.lookup-optimize: on transport.address-family: inet nfs.disable: off [root@gqas013 glusterfs]#