Description of problem: ======================== Writes are happening on files from nfs mount on 2x2 dis-rep volume. After attaching tier, all the writes and ls is hung on the nfs mount. Version-Release number of selected component (if applicable): ================================================================ glusterfs-3.7.5-19.el7rhgs.x86_64 How reproducible: ================== Every time Steps to Reproduce: ==================== 1. Create a 2x2 dis-rep volume. Start the volume. Mount the volume(nfs mount) 2. Open fd's on files and start writes on those fd's . 3. attach-tier ( 2x2 ) to the volume. 4. Try listing the files on the mount Actual results: =============== Both writes and ls is hung Expected results: ================== attach-tier shouldn't effect the IO's on mount Additional info: ================== [root@rhsauto019 ~]# gluster v info Volume Name: testvol Type: Tier Volume ID: 52063226-a694-4352-8b84-774a6d6ad98b Status: Started Number of Bricks: 8 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 2 x 2 = 4 Brick1: rhsauto022.lab.eng.blr.redhat.com:/bricks/brick1/testvol_tier3 Brick2: rhsauto021.lab.eng.blr.redhat.com:/bricks/brick1/testvol_tier2 Brick3: rhsauto020.lab.eng.blr.redhat.com:/bricks/brick1/testvol_tier1 Brick4: rhsauto019.lab.eng.blr.redhat.com:/bricks/brick1/testvol_tier0 Cold Tier: Cold Tier Type : Distributed-Replicate Number of Bricks: 2 x 2 = 4 Brick5: rhsauto019.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick0 Brick6: rhsauto020.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick1 Brick7: rhsauto021.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick2 Brick8: rhsauto022.lab.eng.blr.redhat.com:/bricks/brick0/testvol_brick3 Options Reconfigured: cluster.watermark-hi: 40 cluster.watermark-low: 2 cluster.tier-mode: cache features.ctr-enabled: on performance.quick-read: off performance.io-cache: off features.uss: enable features.quota-deem-statfs: on features.inode-quota: on features.quota: on performance.readdir-ahead: on [root@rhsauto019 ~]# 'ls' output before Attaching-tier ================================== [root@skoduri ~]# ls -lh /mnt/nfs total 244K -rwxr-xr-x. 1 root root 54K Feb 12 13:16 file_dir_ops.sh -rw-r--r--. 1 root root 6.0K Feb 12 13:18 testfile_1 -rw-r--r--. 1 root root 6.4K Feb 12 13:18 testfile_10 -rw-r--r--. 1 root root 6.2K Feb 12 13:18 testfile_11 -rw-r--r--. 1 root root 6.9K Feb 12 13:18 testfile_12 -rw-r--r--. 1 root root 6.9K Feb 12 13:18 testfile_13 -rw-r--r--. 1 root root 6.4K Feb 12 13:18 testfile_14 -rw-r--r--. 1 root root 5.9K Feb 12 13:18 testfile_15 -rw-r--r--. 1 root root 6.7K Feb 12 13:18 testfile_16 -rw-r--r--. 1 root root 6.2K Feb 12 13:18 testfile_17 -rw-r--r--. 1 root root 6.4K Feb 12 13:18 testfile_18 -rw-r--r--. 1 root root 5.5K Feb 12 13:18 testfile_19 -rw-r--r--. 1 root root 6.6K Feb 12 13:18 testfile_2 -rw-r--r--. 1 root root 6.6K Feb 12 13:18 testfile_20 -rw-r--r--. 1 root root 6.0K Feb 12 13:18 testfile_21 -rw-r--r--. 1 root root 6.0K Feb 12 13:18 testfile_22 -rw-r--r--. 1 root root 6.6K Feb 12 13:18 testfile_23 -rw-r--r--. 1 root root 6.2K Feb 12 13:18 testfile_24 -rw-r--r--. 1 root root 6.1K Feb 12 13:18 testfile_25 -rw-r--r--. 1 root root 6.5K Feb 12 13:18 testfile_26 -rw-r--r--. 1 root root 6.3K Feb 12 13:18 testfile_27 -rw-r--r--. 1 root root 6.0K Feb 12 13:18 testfile_28 -rw-r--r--. 1 root root 6.2K Feb 12 13:18 testfile_29 -rw-r--r--. 1 root root 6.2K Feb 12 13:18 testfile_3 -rw-r--r--. 1 root root 6.6K Feb 12 13:18 testfile_4 -rw-r--r--. 1 root root 6.6K Feb 12 13:18 testfile_5 -rw-r--r--. 1 root root 7.1K Feb 12 13:18 testfile_6 -rw-r--r--. 1 root root 6.5K Feb 12 13:18 testfile_7 -rw-r--r--. 1 root root 6.2K Feb 12 13:18 testfile_8 -rw-r--r--. 1 root root 6.2K Feb 12 13:18 testfile_9 After attach-tier, the files are not migrated to hot-tier: ========================================================== [2016-02-12 07:49:23.794702] I [MSGID: 109038] [tier.c:170:tier_check_watermark] 0-testvol-tier-dht: Tier watermark now 1 [2016-02-12 08:40:07.217141] I [MSGID: 109028] [dht-rebalance.c:3619:gf_defrag_status_get] 0-glusterfs: Rebalance is in progress. Time taken is 3049.00 secs [2016-02-12 08:40:07.217194] I [MSGID: 109028] [dht-rebalance.c:3623:gf_defrag_status_get] 0-glusterfs: Files migrated: 0, size: 0, lookups: 0, failures: 0, skipped: 0
1. How reproducible is this problem, does it happen every time you run this test? 2. Is there a distaf testcase that we can run to easily reproduce this problem? 3. Does it only happen when mounting over NFS, or also with FUSE, Samba, Ganesha? 4. Did you try to reproduce in upstream, has it already been fixed there?
1) and 3) This is happening everytime we run this on nfs mount. It doesn't happen on FUSE. 2) This is a distaf testcase and can be easily recreated. (But the testcase is not in the git repo yet) 4) No haven't tried to reproduce in upstream.
When checked with the socket#, we found that fd#21 is opened for that socket by gluster-nfs process. From the core- (gdb) p ev_data $4 = (struct event_thread_data *) 0x7ff5790c1b10 (gdb) p ev_data->event_indec There is no member named event_indec. (gdb) p ev_data->event_index $5 = 1 (gdb) p event_pool->activethreadcount $6 = 3 (gdb) p event_pool->eventthreadcount $7 = 2 (gdb) p *event_pool $8 = {ops = 0x7ff578ac8700 <event_ops_epoll>, fd = 3, breaker = {0, 0}, count = 16384, reg = 0x0, ereg = { 0x7ff5790b1980, 0x0 <repeats 1023 times>}, slots_used = {18, 0 <repeats 1023 times>}, used = 0, changed = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = { __prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, cond = {__data = { __lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}, evcache = 0x0, evcache_size = 0, eventthreadcount = 2, pollers = {140692049266432, 140691903022848, 0 <repeats 30 times>}, destroy = 0, activethreadcount = 3} (gdb) p ctx No symbol "ctx" in current context. (gdb) p *event_pool->slots_used $9 = 18 (gdb) p event_pool->slots_used $10 = {18, 0 <repeats 1023 times>} (gdb) p event_pool->ereg[0] $11 = (struct event_slot_epoll *) 0x7ff5790b1980 (gdb) p event_pool->ereg[1] $12 = (struct event_slot_epoll *) 0x0 (gdb) p *event_pool->ereg[0] $13 = {fd = 7, events = 1073741827, gen = 1, ref = 1, do_close = 0, in_handler = 0, data = 0x7ff57908f5d0, handler = 0x7ff56d4baaf0 <socket_server_event_handler>, lock = 1} (gdb) p event_pool->slots_used[0] $14 = 18 (gdb) p *event_pool->ereg[1] Cannot access memory at address 0x0 (gdb) p *event_pool->ereg[2] Cannot access memory at address 0x0 (gdb) p *event_pool->ereg[3] Cannot access memory at address 0x0 (gdb) p $11 $15 = (struct event_slot_epoll *) 0x7ff5790b1980 (gdb) p $11[0] $16 = {fd = 7, events = 1073741827, gen = 1, ref = 1, do_close = 0, in_handler = 0, data = 0x7ff57908f5d0, handler = 0x7ff56d4baaf0 <socket_server_event_handler>, lock = 1} (gdb) p $11[1] $17 = {fd = 9, events = 1073741827, gen = 1, ref = 1, do_close = 0, in_handler = 0, data = 0x7ff5790bf7b0, handler = 0x7ff56d4be240 <socket_event_handler>, lock = 1} (gdb) p $11[2] $18 = {fd = 11, events = 1073741827, gen = 1, ref = 1, do_close = 0, in_handler = 0, data = 0x7ff56404a9f0, handler = 0x7ff56d4baaf0 <socket_server_event_handler>, lock = 1} (gdb) p $11[3] $19 = {fd = 12, events = 1073741827, gen = 1, ref = 1, do_close = 0, in_handler = 0, data = 0x7ff56404be30, handler = 0x7ff56d4baaf0 <socket_server_event_handler>, lock = 1} (gdb) p $11[4] $20 = {fd = 13, events = 1073741827, gen = 1, ref = 1, do_close = 0, in_handler = 0, data = 0x7ff56404d200, handler = 0x7ff56d4baaf0 <socket_server_event_handler>, lock = 1} (gdb) p $11[5] $21 = {fd = 14, events = 1073741827, gen = 1, ref = 1, do_close = 0, in_handler = 0, data = 0x7ff56404e6a0, handler = 0x7ff56d4baaf0 <socket_server_event_handler>, lock = 1} (gdb) p $11[6] $22 = {fd = 15, events = 1073741827, gen = 1, ref = 1, do_close = 0, in_handler = 0, data = 0x7ff56404fda0, handler = 0x7ff56d4baaf0 <socket_server_event_handler>, lock = 1} (gdb) p $11[7] $23 = {fd = 18, events = 1073741827, gen = 13, ref = 1, do_close = 0, in_handler = 0, data = 0x7ff5641f5b90, handler = 0x7ff56d4be240 <socket_event_handler>, lock = 1} (gdb) p $11[8] $24 = {fd = 19, events = 1073741827, gen = 4, ref = 1, do_close = 0, in_handler = 0, data = 0x7ff5641a8130, handler = 0x7ff56d4be240 <socket_event_handler>, lock = 1} (gdb) p $11[9] $25 = {fd = 20, events = 1073741827, gen = 1, ref = 1, do_close = 0, in_handler = 0, data = 0x7ff564181400, handler = 0x7ff56d4be240 <socket_event_handler>, lock = 1} (gdb) p $11[10] $26 = {fd = 10, events = 1073741827, gen = 13, ref = 1, do_close = 0, in_handler = 0, data = 0x7ff56407f7b0, handler = 0x7ff56d4be240 <socket_event_handler>, lock = 1} (gdb) p $11[11] $27 = {fd = 22, events = 1073741827, gen = 1, ref = 1, do_close = 0, in_handler = 0, data = 0x7ff56415a6d0, handler = 0x7ff56d4be240 <socket_event_handler>, lock = 1} (gdb) p $11[12] $28 = {fd = 23, events = 1073741827, gen = 1, ref = 1, do_close = 0, in_handler = 0, data = 0x7ff5641339a0, handler = 0x7ff56d4be240 <socket_event_handler>, lock = 1} (gdb) p $11[13] $29 = {fd = 21, events = 1073741826, gen = 4, ref = 1, do_close = 0, in_handler = 0, data = 0x7ff564002140, handler = 0x7ff56d4be240 <socket_event_handler>, lock = 1} (gdb) p event_pool->slots_used[13] $30 = 0 (gdb) p event_pool->slots_used[0][13] cannot subscript something of type `int' (gdb) p event_pool->ereg[0][13] $31 = {fd = 21, events = 1073741826, gen = 4, ref = 1, do_close = 0, in_handler = 0, data = 0x7ff564002140, handler = 0x7ff56d4be240 <socket_event_handler>, lock = 1} (gdb) p event_pool->ereg[0][13]->events $32 = 1073741826 (gdb) p *event_pool->ereg[0][13]->events Cannot access memory at address 0x40000002 (gdb) p &event_pool->ereg[0][13]->events $33 = (int *) 0x7ff5790b1bf4 (gdb) p/x event_pool->ereg[0][13]->events <<<<<<<<<<<<< EPOLLIN is not enabled $34 = 0x40000002 (gdb) p/u 1u << 30 $35 = 1073741824 (gdb) p/x 1u << 30 $36 = 0x40000000 (gdb) p/x ($34|0x001) $37 = 0x40000003 (gdb) p/x event_pool->ereg[0][1]->events $38 = 0x40000003 (gdb) p/x event_pool->ereg[0][0]->events $39 = 0x40000003 (gdb) p/x event_pool->ereg[0][2]->events $40 = 0x40000003 (gdb) p/x event_pool->ereg[0][3]->events $41 = 0x40000003 (gdb) p/x event_pool->ereg[0][4]->events $42 = 0x40000003 (gdb) p/x event_pool->ereg[0][5]->events $43 = 0x40000003 (gdb) p/x event_pool->ereg[0][6]->events $44 = 0x40000003 (gdb) p/x event_pool->ereg[0][7]->events $45 = 0x40000003 (gdb) p/x event_pool->ereg[0][8]->events $46 = 0x40000003 (gdb) p/x event_pool->ereg[0][9]->events $47 = 0x40000003 (gdb) p/x event_pool->ereg[0][10]->events $48 = 0x40000003 (gdb) p/x event_pool->ereg[0][11]->events $49 = 0x40000003 (gdb) p/x event_pool->ereg[0][12]->events $50 = 0x40000003 (gdb) p/x event_pool->ereg[0][13]->events $51 = 0x40000002 (gdb) p/x event_pool->ereg[0][14]->events $52 = 0x40000003 As suspected, there is no EPOLLIN event enabled for this particular fd. We need to analyse the possible scenarios which could result in this case.
Created attachment 1127291 [details] nfs-hang-from-server.pcap
If the request is served from drc, we don't decrement the outstanding request count as we are missing the rpcsvc_request_destroy. This is causing the outstanding request count to be always greater than outstanding request limit, resulting in this issue. I'll send a patch to fix this issue. /* DRC */ if (rpcsvc_need_drc (req)) { drc = req->svc->drc; LOCK (&drc->lock); { reply = rpcsvc_drc_lookup (req); /* retransmission of completed request, send cached reply */ if (reply && reply->state == DRC_OP_CACHED) { gf_log (GF_RPCSVC, GF_LOG_INFO, "duplicate request:" " XID: 0x%x", req->xid); ret = rpcsvc_send_cached_reply (req, reply); drc->cache_hits++; UNLOCK (&drc->lock); >>>>>>>>>>>>>>> we don't destroy request here, causing a request leak and making outstanding request count off by 1 (for every request served from drc). goto out; } /* retransmitted request, original op in transit, drop it */
I thought DRC is disable by default. Is that only for Gluster/NFS and not for the Gluster-client? If it really is disabled, we should not be able to reach this piece of code...
I am not so familiar with NFS. Soumya should be able to answer this.
As Niels mentioned, DRC seems to be disabled by default and hence its unlikely we hit the above mentioned code path.
In (In reply to Soumya Koduri from comment #10) > As Niels mentioned, DRC seems to be disabled by default and hence its > unlikely we hit the above mentioned code path. In that case, my analysis is invalid.
This issue is seen with the following workload too. 1) create a 4x2 dist-rep cold tier 2) Perform an nfs mount of this volume and create bigger size files using dd for i in {1..10};do dd if=/dev/urandom of=file$i bs=1024 count=700000;echo $?;done 3) Attach tier (2x2) and allow attach tier to complete 4) detach tier and allow detach tier to complete 5) Attach tier once again with a different or cleaned up brick path. IO is hung while writing 'file-3' sosreports from all 4 nodes and client are available here --> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1306930/