Bug 1571293 - Remove-brick failed on Distributed-Disperse volume while rm -rf is in-progress
Summary: Remove-brick failed on Distributed-Disperse volume while rm -rf is in-progress
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: rpc
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Mohit Agrawal
QA Contact: Rahul Hinduja
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-24 13:10 UTC by Prasad Desala
Modified: 2019-07-10 06:44 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-11-10 07:57:20 UTC
Embargoed:


Attachments (Terms of Use)

Description Prasad Desala 2018-04-24 13:10:33 UTC
Description of problem:
=======================
Remove-brick failed on a Distributed-Disperse volume while rm -rf is in-progress

Version-Release number of selected component (if applicable):
3.12.2-8.el7rhgs.x86_64

How reproducible:
1/1

Steps to Reproduce:
===================
1) Create a Distributed-Disperse volume and start it.
2) FUSE mount it on multiple  clients.
3) From mount point, create Files and directories.
4) Start removing the data set using rm -rf * from multiple clients.
5) Now, initiate remove-brick operation.

Out of 6 nodes, remove-brick failed on 5 nodes with different errors. If the root cause is different for these errors, let me know I'll file new BZs to track these errors separately.

Node-1: 
[2018-04-24 11:41:52.185492] W [MSGID: 109073] [dht-common.c:10519:dht_notify] 0-ec_new-dht: Received CHILD_DOWN. Exiting
The message "W [MSGID: 109073] [dht-common.c:10519:dht_notify] 0-ec_new-dht: Received CHILD_DOWN. Exiting" repeated 2 times between [2018-04-24 11:41:52.185492] and [2018-04-24 11:41:52.185703]
[2018-04-24 11:41:52.200419] E [MSGID: 101046] [dht-common.c:1501:dht_lookup_dir_cbk] 0-ec_new-dht: dict is null
[2018-04-24 11:41:52.290707] E [MSGID: 109027] [dht-rebalance.c:4422:gf_defrag_start_crawl] 0-ec_new-dht: Failed to start rebalance: look up on / failed [Transport endpoint is not connected]

Node-2:
[2018-04-24 11:43:47.921584] E [MSGID: 109016] [dht-rebalance.c:3840:gf_defrag_fix_layout] 0-ec_new-dht: Fix layout failed for /linux-4.9.27/Documentation/devicetree/bindings/power/supply


Node-3:
[2018-04-24 11:42:27.198197] W [dht-rebalance.c:3386:gf_defrag_process_dir] 0-ec_new-dht: Found error from gf_defrag_get_entry
[2018-04-24 11:42:27.199605] E [MSGID: 109111] [dht-rebalance.c:3903:gf_defrag_fix_layout] 0-ec_new-dht: gf_defrag_process_dir failed for directory: /linux-4.9.27/Documentation/devicetree/bindings/arm
[2018-04-24 11:42:27.210743] E [MSGID: 109016] [dht-rebalance.c:3840:gf_defrag_fix_layout] 0-ec_new-dht: Fix layout failed for /linux-4.9.27/Documentation/devicetree/bindings/arm

Node-4:
[2018-04-24 11:44:20.565793] E [MSGID: 109110] [dht-rebalance.c:3926:gf_defrag_fix_layout] 0-ec_new-dht: Settle hash failed for /linux-4.9.27/Documentation/devicetree/bindings/powerpc/nintendo
[2018-04-24 11:44:20.571375] E [MSGID: 109016] [dht-rebalance.c:3840:gf_defrag_fix_layout] 0-ec_new-dht: Fix layout failed for /linux-4.9.27/Documentation/devicetree/bindings/powerpc/nintendo

Node-5:
[2018-04-24 11:42:05.224439] W [MSGID: 122040] [ec-common.c:1144:ec_prepare_update_cbk] 0-ec_new-disperse-3: Failed to get size and version [Input/output error]
[2018-04-24 11:42:05.224712] E [MSGID: 109039] [dht-common.c:4078:dht_find_local_subvol_cbk] 0-ec_new-dht: getxattr err for dir [Input/output error]
[2018-04-24 11:42:05.323977] W [MSGID: 122040] [ec-common.c:1144:ec_prepare_update_cbk] 0-ec_new-disperse-0: Failed to get size and version [Input/output error]
[2018-04-24 11:42:05.324146] E [MSGID: 109039] [dht-common.c:4078:dht_find_local_subvol_cbk] 0-ec_new-dht: getxattr err for dir [Input/output error]
[2018-04-24 11:42:05.324991] W [MSGID: 122040] [ec-common.c:1144:ec_prepare_update_cbk] 0-ec_new-disperse-8: Failed to get size and version [Input/output error]
[2018-04-24 11:42:05.325125] E [MSGID: 109039] [dht-common.c:4078:dht_find_local_subvol_cbk] 0-ec_new-dht: getxattr err for dir [Input/output error]
[2018-04-24 11:42:05.333846] E [MSGID: 0] [dht-rebalance.c:4279:dht_get_local_subvols_and_nodeuuids] 0-ec_new-dht: local subvolume determination failed with error: 5 [Input/output error]

Node-6: Remove-brick completed successfully.

Actual results:
==============
Remove-brick failed on nodes while rm -rf is in-progress

Expected results:
=================
Remove-brick should complete without failure.

Comment 27 RHEL Program Management 2018-05-17 14:22:59 UTC
Development Management has reviewed and declined this request.
You may appeal this decision by reopening this request.

Comment 29 Atin Mukherjee 2018-11-10 06:53:43 UTC
Is there any pending work for this bug to get to it's closure?


Note You need to log in before you can comment on or make changes to this bug.