Description of problem: ======================= On a 11x3 volume, when a replica pair brick is brought down while remove-brick is Rebalance on a node failed when a replica pair brick is brought down Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Hit enter by mistake while filing bug. Please see the complete details of the bug below. Description of problem: ======================= Few files are not migrated on the decommissioned bricks when bricks are brought down while remove-brick is in-progress. Version-Release number of selected component (if applicable): 3.12.2-7.el7rhgs.x86_64 How reproducible: 1/1 Steps to Reproduce: =================== 1) Create a x3 volume and start it. 2) FUSE mount on multiple clients and start linux kernel untar and lookups from clients. 3) Start removing few bricks. 4) While remove-brick is in-porgress, kill a brick from a replica pair. As brick mux is enabled killing single brick on the server using kill -9 would take down all the bricks on the node. 5) wait till the rebalance completes on the nodes. Actual results: =============== Few files are not migrated on the decommissioned bricks; commit results in data loss. Expected results: ================ Remove-brick operation should migrate all the files from the decommissioned brick.
From rebalance log: [2018-04-06 11:52:54.484298] I [MSGID: 0] [dht-rebalance.c:3732:gf_defrag_fix_layout] 0-nithya: entry->name = vexpress-scc.txt [2018-04-06 11:52:54.484568] W [MSGID: 114061] [client-common.c:1197:client_pre_readdirp] 0-pingx3-client-30: (6072b1ff-c676-4c76-993c-cfad73e0a4f5) remote_fd is -1. EBADFD [File descriptor in bad state] [2018-04-06 11:52:54.485145] E [MSGID: 109058] [dht-rebalance.c:3715:gf_defrag_fix_layout] 0-pingx3-dht: readdirp failed for path /linux-4.4.36/Documentation/devicetree/bindings/arm. Aborting fix-layout [File descriptor in bad state] This is a known issue. The brick that was brought down was serving readdirp request. And this generally are not transferred to other afr children as there will be a offset mismatch between bricks. Moving this to AFR component for clarification on the same. Please move this back to DHT if you feel otherwise. - Susant
(In reply to Susant Kumar Palai from comment #6) > From rebalance log: > [2018-04-06 11:52:54.484298] I [MSGID: 0] > [dht-rebalance.c:3732:gf_defrag_fix_layout] 0-nithya: entry->name = > vexpress-scc.txt > [2018-04-06 11:52:54.484568] W [MSGID: 114061] > [client-common.c:1197:client_pre_readdirp] 0-pingx3-client-30: > (6072b1ff-c676-4c76-993c-cfad73e0a4f5) remote_fd is -1. EBADFD [File > descriptor in bad state] > [2018-04-06 11:52:54.485145] E [MSGID: 109058] > [dht-rebalance.c:3715:gf_defrag_fix_layout] 0-pingx3-dht: readdirp failed > for path /linux-4.4.36/Documentation/devicetree/bindings/arm. Aborting > fix-layout [File descriptor in bad state] > > > This is a known issue. The brick that was brought down was serving readdirp > request. And this generally are not transferred to other afr children as > there will be a offset mismatch between bricks. > z > Moving this to AFR component for clarification on the same. Please move this > back to DHT if you feel otherwise. > Yes, AFR has fail over for readdirs only if it on offset 0. If a readdir cbk fails in the middle, then it cannot be re-tried on a different brick. Atin, is it okay to close this as WONTFIX if we feel that is the appropriate thing to do?