Description of problem: ======================= Remove-brick rebalance on a node failed; log says "Readdirp failed. Aborting data migration for directory: / [File descriptor in bad state]" Version-Release number of selected component (if applicable): 3.8.4-24.el7rhgs.x86_64 samba-4.6.3-0.el7rhgs.x86_64 How reproducible: 1/1 Steps to Reproduce: =================== 1) Create a distributed-replicate volume and start it. 2) Enable nl-cache settings. 3) cifs mount the volume on multiple clients. 4) Perform below tasks simultaneously from multiple clients, Client-1: Start creating 6000 files for i in {1..6000}; do cat /etc/redhat-release > f$i;done Continuous lookup -for i in {1..4000};do ll f$i fl$i;done Client-2: Start creating hardlinks for i in {1..4000};do ln f$i fl$i;done Continuous lookup - for i in {1..4000};do ll f$i fl$i;done 5) While step-4 is in-progress start removing few bricks from the volume. Keep monitoring remove-brick status. Rebalance logs: =============== [2017-05-11 10:17:59.827740] I [MSGID: 109022] [dht-rebalance.c:1998:dht_migrate_file] 0-distrep-dht: completed migration of /f2577 from subvolume distrep-replicate-9 to distrep-replicate-3 [2017-05-11 10:17:59.832095] I [MSGID: 109069] [dht-common.c:1322:dht_lookup_unlink_cbk] 0-distrep-dht: lookup_unlink returned with op_ret -> 0 and op-errno -> 0 for /fl341 [2017-05-11 10:17:59.842942] W [MSGID: 114061] [client-common.c:1198:client_pre_readdirp] 0-distrep-client-6: (00000000-0000-0000-0000-000000000001) remote_fd is -1. EBADFD [File descriptor in bad state] [2017-05-11 10:17:59.843917] W [MSGID: 109021] [dht-rebalance.c:2684:gf_defrag_get_entry] 0-distrep-dht: Readdirp failed. Aborting data migration for directory: / [File descriptor in bad state] [2017-05-11 10:17:59.843984] W [dht-rebalance.c:3142:gf_defrag_process_dir] 0-distrep-dht: Found error from gf_defrag_get_entry [2017-05-11 10:17:59.844404] E [MSGID: 109111] [dht-rebalance.c:3439:gf_defrag_fix_layout] 0-distrep-dht: gf_defrag_process_dir failed for directory: / [2017-05-11 10:17:59.844614] I [dht-rebalance.c:2563:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 3 [2017-05-11 10:17:59.844745] I [dht-rebalance.c:2563:gf_defrag_task] 0-DHT: Thread wokeup. defrag->current_thread_count: 4 NOTE: Only smb packages are installed on the system and CTDB is **NOT** configured. Actual results: =============== Remove-brick rebalance on a node failed; log says "Readdirp failed. Aborting data migration for directory: / [File descriptor in bad state]" Expected results: ================ Rebalance should complete without any failures.
nl-cache mostly has nothing to do with readdirp or rebalance, can you please try the same disabling nl-cache? and is it consistently seen only with nl-cache enabled?
Need info on comment 3
Also i see that parallel-readdir was not enabled on this volume. Hence looks like neither nl-cache nor parallel-readir bug.
Assigning this to myself while I check if this is a DHT issue. I am not changing the component just yet.
CCing Atin on this BZ as the crash looks like it may be related to brick mux.
From the core --> (gdb) f 1 #1 0x00007f74e9995dfa in server_process_event_upcall (this=0x7f74ec0283d0, data=data@entry=0x7f74b8571ef0) at server.c:1353 1353 if (!client || strcmp(client->client_uid, client_uid)) (gdb) p client $1 = <optimized out> (gdb) p client $2 = <optimized out> (gdb) l 1348 list_for_each_entry (xprt, &conf->xprt_list, list) { 1349 client = xprt->xl_private; 1350 1351 /* 'client' is not atomically added during xprt entry 1352 * addition to the list. */ 1353 if (!client || strcmp(client->client_uid, client_uid)) 1354 continue; 1355 1356 ret = rpcsvc_request_submit (conf->rpc, xprt, 1357 &server_cbk_prog, (gdb) p xprt $3 = (rpc_transport_t *) 0x7f74ec2f8b90 (gdb) p xprt->xl_private $4 = (struct _client_t *) 0x7f74c0009600 (gdb) p *$4 $5 = {scratch_ctx = {lock = {spinlock = -1073729472, mutex = {__data = {__lock = -1073729472, __count = 32628, __owner = -1073700176, __nusers = 32628, __kind = 0, __spins = 0, __list = {__prev = 0x7f74c00086b0, __next = 0x7f74c000a4d0}}, __size = "@0\000\300t\177\000\000\260\242\000\300t\177\000\000\000\000\000\000\000\000\000\000\260\206\000\300t\177\000\000ะค\000\300t\177\000", __align = 140139414171712}}, count = 45680, ctx = 0x7f74c000a3a0}, bind = { cnt = 140139414200704, lk = {spinlock = -335454016, mutex = {__data = {__lock = -335454016, __count = 32628, __owner = -347170848, __nusers = 32628, __kind = -347171040, __spins = 32628, __list = { __prev = 0x7f74eb4e9780 <dumpops>, __next = 0x7f74c000a400}}, __size = "\300`\001\354t\177\000\000\340\227N\353t\177\000\000 \227N\353t\177\000\000\200\227N\353t\177\000\000\000\244\000\300t\177\000", __align = 140140152447168}}}, count = {cnt = 140139414201344, lk = { spinlock = -349325120, mutex = {__data = {__lock = -349325120, __count = 32628, __owner = -349325760, __nusers = 32628, __kind = -349325824, __spins = 32628, __list = { __prev = 0x7f74eb2da660 <mem_acct_init>, __next = 0x7f74fe5c2970 <default_notify>}}, __size = "\300\270-\353t\177\000\000@\266-\353t\177\000\000\000\266-\353t\177\000\000`\246-\353t\177\000\000p)\\\376t\177\000", __align = 140140138576064}}}, bound_xl = 0x0, this = 0x0, tbl_index = 0, client_uid = 0x0, auth = {flavour = 0, len = 0, data = 0x0, username = 0x0, passwd = 0x0}} (gdb) Upcalls are sent (by default) to all the clients connected to the brick process except for the one which actually is doing the fop (which triggered the upcall). Hence in 'server_process_event_upcall', we scan through existing transport objects and match their corresponding clients' client_uid with the one requested the fop. From the core looks like, the rpc transport object has reference to client object which is either invalid or uninitialized. Request Du to take a look and provide his comments.
looks like a DISCONNECT event on a transport came to server even before an ACCEPT event was delivered, leading to a stale xprt object in conf->xprt_list. Hence likely a duplicate of bz 1442535 and fixed by patch [1] Note that the patch [1] went in 3.8.4-25, but this bug is filed under 3.8.4-24. [1] https://code.engineering.redhat.com/gerrit/105653
Based on comment 17, moving this bug to ON_QA.
(In reply to Raghavendra G from comment #17) > looks like a DISCONNECT event on a transport came to server even before an > ACCEPT event was delivered, leading to a stale xprt object in > conf->xprt_list. Hence likely a duplicate of bz 1442535 and fixed by patch > [1] > > Note that the patch [1] went in 3.8.4-25, but this bug is filed under > 3.8.4-24. > > [1] https://code.engineering.redhat.com/gerrit/105653 Thanks Raghavendra.
This issue is not reproduced on glusterfs version 3.8.4-32.el7rhgs.x86_64. Moving this BZ to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774