Fedora Account System
Red Hat Associate
Red Hat Customer
Description of problem: Remove-brick start is successful and it is in progress for 5.5 days, After that it is showing failed in the remove-brick status for one of the node -----------------------8<------------------------------------ Rebalance log before it went to failed state on one node. -----------------------8<------------------------------------ [2020-03-11 21:19:02.824911] E [MSGID: 109016] [dht-rebalance.c:3910:gf_defrag_fix_layout] 0-ec-vol-dht: Fix layout failed for /IOs/kernel/dhcp42-6.lab.eng.blr.redhat.com/dir.13/linux-5.3.2/Documentation/devicetree/bindings/memory-controllers/fsl [2020-03-11 21:19:02.824948] E [MSGID: 109016] [dht-rebalance.c:3910:gf_defrag_fix_layout] 0-ec-vol-dht: Fix layout failed for /IOs/kernel/dhcp42-6.lab.eng.blr.redhat.com/dir.13/linux-5.3.2/Documentation/devicetree/bindings/memory-controllers [2020-03-11 21:19:02.826627] E [MSGID: 109016] [dht-rebalance.c:3910:gf_defrag_fix_layout] 0-ec-vol-dht: Fix layout failed for /IOs/kernel/dhcp42-6.lab.eng.blr.redhat.com/dir.13/linux-5.3.2/Documentation/devicetree/bindings [2020-03-11 21:19:02.827770] E [MSGID: 109016] [dht-rebalance.c:3910:gf_defrag_fix_layout] 0-ec-vol-dht: Fix layout failed for /IOs/kernel/dhcp42-6.lab.eng.blr.redhat.com/dir.13/linux-5.3.2/Documentation/devicetree [2020-03-11 21:19:02.829078] E [MSGID: 109016] [dht-rebalance.c:3910:gf_defrag_fix_layout] 0-ec-vol-dht: Fix layout failed for /IOs/kernel/dhcp42-6.lab.eng.blr.redhat.com/dir.13/linux-5.3.2/Documentation [2020-03-11 21:19:02.830316] E [MSGID: 109016] [dht-rebalance.c:3910:gf_defrag_fix_layout] 0-ec-vol-dht: Fix layout failed for /IOs/kernel/dhcp42-6.lab.eng.blr.redhat.com/dir.13/linux-5.3.2 [2020-03-11 21:19:02.831434] E [MSGID: 109016] [dht-rebalance.c:3910:gf_defrag_fix_layout] 0-ec-vol-dht: Fix layout failed for /IOs/kernel/dhcp42-6.lab.eng.blr.redhat.com/dir.13 [2020-03-11 21:19:02.832685] E [MSGID: 109016] [dht-rebalance.c:3910:gf_defrag_fix_layout] 0-ec-vol-dht: Fix layout failed for /IOs/kernel/dhcp42-6.lab.eng.blr.redhat.com [2020-03-11 21:19:02.833821] E [MSGID: 109016] [dht-rebalance.c:3910:gf_defrag_fix_layout] 0-ec-vol-dht: Fix layout failed for /IOs/kernel [2020-03-11 21:19:02.834964] E [MSGID: 109016] [dht-rebalance.c:3910:gf_defrag_fix_layout] 0-ec-vol-dht: Fix layout failed for /IOs [2020-03-11 21:19:02.873745] I [MSGID: 109028] [dht-rebalance.c:5059:gf_defrag_status_get] 0-ec-vol-dht: Rebalance is failed. Time taken is 490617.00 secs [2020-03-11 21:19:02.873789] I [MSGID: 109028] [dht-rebalance.c:5065:gf_defrag_status_get] 0-ec-vol-dht: Files migrated: 59860, size: 3634086108, lookups: 484138, failures: 43439, skipped: 0 [2020-03-11 21:19:02.905270] W [glusterfsd.c:1581:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x82de) [0x7f05729832de] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xfd) [0x56203790686d] -->/usr/sbin/glusterfs(cleanup_and_exit+0x58) [0x5620379066b8] ) 0-: received signum (15), shutting down --------------------------8<----------------------------------- Version-Release number of selected component (if applicable): How reproducible: Once Steps to Reproduce: 1. On a three node cluster, enabled brick-mux. 2. Created two replicated(1X3) volumes and distributed-disperse volume(4 x (4 + 2)) 3. Mounted ec-vol on 11 clients and ran linux untar, crefi, lookups from the clients. 4. After data filled is at 600GB, performed remove-brick start 5. As the data is huge, performed rm -rf where the data is not being written removed 1-18 directores on 11 clients, where data is being written from 24th directory Actual results: After performing remove-brick and started rm -rf on the directories. remove-bricks was in-progress ----------------------------------8<---------------------------- gluster volume remove-brick ec-vol dhcp37-200.lab.eng.blr.redhat.com:/bricks/brick8/ec-vol8 dhcp37-157.lab.eng.blr.redhat.com:/bricks/brick8/ec-vol8 dhcp37-114.lab.eng.blr.redhat.com:/bricks/brick8/ec-vol8 dhcp37-200.lab.eng.blr.redhat.com:/bricks/brick9/ec-vol9 dhcp37-157.lab.eng.blr.redhat.com:/bricks/brick9/ec-vol9 dhcp37-114.lab.eng.blr.redhat.com:/bricks/brick9/ec-vol9 status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- dhcp37-157.lab.eng.blr.redhat.com 57441 1.7GB 437061 10367 0 in progress 127:56:31 dhcp37-114.lab.eng.blr.redhat.com 57802 3.4GB 437407 10575 0 in progress 127:56:31 localhost 57836 891.5MB 437052 10185 0 in progress 127:56:31 Estimated time left for rebalance to complete : 6726:17:29 ---------------------------------8<-------------------------------- gluster volume remove-brick ec-vol dhcp37-200.lab.eng.blr.redhat.com:/bricks/brick8/ec-vol8 dhcp37-157.lab.eng.blr.redhat.com:/bricks/brick8/ec-vol8 dhcp37-114.lab.eng.blr.redhat.com:/bricks/brick8/ec-vol8 dhcp37-200.lab.eng.blr.redhat.com:/bricks/brick9/ec-vol9 dhcp37-157.lab.eng.blr.redhat.com:/bricks/brick9/ec-vol9 dhcp37-114.lab.eng.blr.redhat.com:/bricks/brick9/ec-vol9 status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- dhcp37-157.lab.eng.blr.redhat.com 61402 1.7GB 527430 72613 0 in progress 143:00:22 dhcp37-114.lab.eng.blr.redhat.com 59860 3.4GB 484138 43439 0 failed 136:16:57 localhost 61812 1.8GB 527705 72442 0 in progress 143:00:22 Estimated time left for rebalance to complete : 4725:27:16 ----------------------------------8<------------------------------- Expected results: Remove-brick operation should succeed and should not fail. Additional info: Volume Name: ec-vol Type: Distributed-Disperse Volume ID: adfdd3e2-5699-42e5-a295-96e6c256c160 Status: Started Snapshot Count: 0 Number of Bricks: 4 x (4 + 2) = 24 Transport-type: tcp Bricks: Brick1: dhcp37-200.lab.eng.blr.redhat.com:/bricks/brick2/ec-vol2 Brick2: dhcp37-157.lab.eng.blr.redhat.com:/bricks/brick2/ec-vol2 Brick3: dhcp37-114.lab.eng.blr.redhat.com:/bricks/brick2/ec-vol2 Brick4: dhcp37-200.lab.eng.blr.redhat.com:/bricks/brick3/ec-vol3 Brick5: dhcp37-157.lab.eng.blr.redhat.com:/bricks/brick3/ec-vol3 Brick6: dhcp37-114.lab.eng.blr.redhat.com:/bricks/brick3/ec-vol3 Brick7: dhcp37-200.lab.eng.blr.redhat.com:/bricks/brick4/ec-vol4 Brick8: dhcp37-157.lab.eng.blr.redhat.com:/bricks/brick4/ec-vol4 Brick9: dhcp37-114.lab.eng.blr.redhat.com:/bricks/brick4/ec-vol4 Brick10: dhcp37-200.lab.eng.blr.redhat.com:/bricks/brick5/ec-vol5 Brick11: dhcp37-157.lab.eng.blr.redhat.com:/bricks/brick5/ec-vol5 Brick12: dhcp37-114.lab.eng.blr.redhat.com:/bricks/brick5/ec-vol5 Brick13: dhcp37-200.lab.eng.blr.redhat.com:/bricks/brick6/ec-vol6 Brick14: dhcp37-157.lab.eng.blr.redhat.com:/bricks/brick6/ec-vol6 Brick15: dhcp37-114.lab.eng.blr.redhat.com:/bricks/brick6/ec-vol6 Brick16: dhcp37-200.lab.eng.blr.redhat.com:/bricks/brick7/ec-vol7 Brick17: dhcp37-157.lab.eng.blr.redhat.com:/bricks/brick7/ec-vol7 Brick18: dhcp37-114.lab.eng.blr.redhat.com:/bricks/brick7/ec-vol7 Brick19: dhcp37-200.lab.eng.blr.redhat.com:/bricks/brick8/ec-vol8 Brick20: dhcp37-157.lab.eng.blr.redhat.com:/bricks/brick8/ec-vol8 Brick21: dhcp37-114.lab.eng.blr.redhat.com:/bricks/brick8/ec-vol8 Brick22: dhcp37-200.lab.eng.blr.redhat.com:/bricks/brick9/ec-vol9 Brick23: dhcp37-157.lab.eng.blr.redhat.com:/bricks/brick9/ec-vol9 Brick24: dhcp37-114.lab.eng.blr.redhat.com:/bricks/brick9/ec-vol9 Options Reconfigured: disperse.shd-max-threads: 24 client.event-threads: 8 server.event-threads: 8 performance.client-io-threads: on transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on cluster.brick-multiplex: enable
Upstream Patch: https://review.gluster.org/#/c/glusterfs/+/24225/
*** Bug 1839948 has been marked as a duplicate of this bug. ***
*** Bug 1395161 has been marked as a duplicate of this bug. ***
Verified this BZ with # rpm -qa | grep gluster glusterfs-libs-6.0-46.el7rhgs.x86_64 glusterfs-api-6.0-46.el7rhgs.x86_64 glusterfs-geo-replication-6.0-46.el7rhgs.x86_64 glusterfs-6.0-46.el7rhgs.x86_64 glusterfs-fuse-6.0-46.el7rhgs.x86_64 glusterfs-cli-6.0-46.el7rhgs.x86_64 python2-gluster-6.0-46.el7rhgs.x86_64 glusterfs-client-xlators-6.0-46.el7rhgs.x86_64 glusterfs-server-6.0-46.el7rhgs.x86_64 Steps performed for verification of this BZ 1. On a three node cluster, enabled brick-mux. 2. Created two replicated(1X3) volumes and distributed-disperse volume(4 x (4 + 2)) 3. Mounted ec-vol on muliple clients and ran linux untar, crefi, lookups from the clients. 4. After data filled, performed remove-brick start 5. Performed rm -rf where the data is not being written Moving this BZ to verified state
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (glusterfs bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5603