Description of problem: ----------------------- I had two 4 node clusters,with one DR volume each.I was doing add-brick on one(with a change in replica count from 2 to3) and remove brick on another.. I could see that my I/O was hung on 2 of my clients in both setups during rebalance for nearly 14 hours,and continues to be in hung state now,even when rebalance is over. Version-Release number of selected component (if applicable): ------------------------------------------------------------- glusterfs-3.8.4-15.el7rhgs.x86_64 nfs-ganesha-2.4.1-7.el7rhgs.x86_64 How reproducible: ------------------ 1/1 Steps to Reproduce: ------------------ 1. Mount volume via v4. 2. Run Bonnie/kernel untars. 3. add/remove brick Actual results: ---------------- IO hangs indefinitely. Expected results: ----------------- Zero error status on clients on rebalance. Additional info: ---------------- CLUSTER 1 (ADD BRICK) : [root@gqas009 ~]# gluster v info Volume Name: butcher Type: Distributed-Replicate Volume ID: b14eca7f-a1a2-4c6e-8380-355652ecf7fa Status: Started Snapshot Count: 0 Number of Bricks: 4 x 3 = 12 Transport-type: tcp Bricks: Brick1: gqas010.sbu.lab.eng.bos.redhat.com:/bricks1/b1 Brick2: gqas009.sbu.lab.eng.bos.redhat.com:/bricks1/b1 Brick3: gqas010.sbu.lab.eng.bos.redhat.com:/bricks6/b1 Brick4: gqas010.sbu.lab.eng.bos.redhat.com:/bricks2/b1 Brick5: gqas009.sbu.lab.eng.bos.redhat.com:/bricks2/b1 Brick6: gqas009.sbu.lab.eng.bos.redhat.com:/bricks6/b1 Brick7: gqas010.sbu.lab.eng.bos.redhat.com:/bricks3/b1 Brick8: gqas009.sbu.lab.eng.bos.redhat.com:/bricks3/b1 Brick9: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/b1 Brick10: gqas010.sbu.lab.eng.bos.redhat.com:/bricks4/b1 Brick11: gqas009.sbu.lab.eng.bos.redhat.com:/bricks4/b1 Brick12: gqas015.sbu.lab.eng.bos.redhat.com:/bricks6/b1 Options Reconfigured: ganesha.enable: on features.uss: enable features.quota-deem-statfs: on features.inode-quota: on features.quota: on performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.cache-samba-metadata: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable [root@gqas009 ~]# gluster v status Status of volume: butcher Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks1/b1 49152 0 Y 28581 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks1/b1 49153 0 Y 29496 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks6/b1 49157 0 Y 1396 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks2/b1 49153 0 Y 28601 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks2/b1 49154 0 Y 29516 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks6/b1 49158 0 Y 2439 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks3/b1 49154 0 Y 28621 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks3/b1 49155 0 Y 29536 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks6/b1 49154 0 Y 702 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks4/b1 49155 0 Y 28641 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks4/b1 49156 0 Y 29556 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks6/b1 49154 0 Y 2015 Snapshot Daemon on localhost 49157 0 Y 30595 Self-heal Daemon on localhost N/A N/A Y 23256 Quota Daemon on localhost N/A N/A Y 23264 Snapshot Daemon on gqas014.sbu.lab.eng.bos. redhat.com 49153 0 Y 29299 Self-heal Daemon on gqas014.sbu.lab.eng.bos .redhat.com N/A N/A Y 17428 Quota Daemon on gqas014.sbu.lab.eng.bos.red hat.com N/A N/A Y 17436 Snapshot Daemon on gqas015.sbu.lab.eng.bos. redhat.com 49153 0 Y 30507 Self-heal Daemon on gqas015.sbu.lab.eng.bos .redhat.com N/A N/A Y 19700 Quota Daemon on gqas015.sbu.lab.eng.bos.red hat.com N/A N/A Y 19708 Snapshot Daemon on gqas010.sbu.lab.eng.bos. redhat.com 49156 0 Y 29606 Self-heal Daemon on gqas010.sbu.lab.eng.bos .redhat.com N/A N/A Y 20404 Quota Daemon on gqas010.sbu.lab.eng.bos.red hat.com N/A N/A Y 20412 Task Status of Volume butcher ------------------------------------------------------------------------------ Task : Rebalance ID : d065363f-a3f8-4256-9d42-ef165e46e250 Status : completed CLUSTER 2 - REMOVE BRICK : [root@gqas013 ~]# gluster v info Volume Name: butcher Type: Distributed-Replicate Volume ID: 0ff07f58-4097-4a99-9a5c-545851678f7d Status: Started Snapshot Count: 0 Number of Bricks: 3 x 3 = 9 Transport-type: tcp Bricks: Brick1: gqas005.sbu.lab.eng.bos.redhat.com:/bricks1/b1 Brick2: gqas006.sbu.lab.eng.bos.redhat.com:/bricks1/b1 Brick3: gqas013.sbu.lab.eng.bos.redhat.com:/bricks1/b1 Brick4: gqas005.sbu.lab.eng.bos.redhat.com:/bricks2/b1 Brick5: gqas006.sbu.lab.eng.bos.redhat.com:/bricks2/b1 Brick6: gqas013.sbu.lab.eng.bos.redhat.com:/bricks2/b1 Brick7: gqas005.sbu.lab.eng.bos.redhat.com:/bricks3/b1 Brick8: gqas006.sbu.lab.eng.bos.redhat.com:/bricks3/b1 Brick9: gqas013.sbu.lab.eng.bos.redhat.com:/bricks3/b1 Options Reconfigured: ganesha.enable: on features.uss: enable features.quota-deem-statfs: on features.inode-quota: on features.quota: on performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.cache-samba-metadata: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable [root@gqas013 ~]# gluster v status Status of volume: butcher Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks1/b1 49153 0 Y 1532 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks1/b1 49153 0 Y 1538 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks1/b1 49153 0 Y 2938 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks2/b1 49154 0 Y 1554 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks2/b1 49154 0 Y 1558 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks2/b1 49154 0 Y 2958 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks3/b1 49155 0 Y 1574 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks3/b1 49155 0 Y 1578 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks3/b1 49155 0 Y 2978 Snapshot Daemon on localhost 49156 0 Y 4096 Self-heal Daemon on localhost N/A N/A Y 27036 Quota Daemon on localhost N/A N/A Y 27044 Snapshot Daemon on gqas008.sbu.lab.eng.bos. redhat.com 49152 0 Y 2553 Self-heal Daemon on gqas008.sbu.lab.eng.bos .redhat.com N/A N/A Y 21030 Quota Daemon on gqas008.sbu.lab.eng.bos.red hat.com N/A N/A Y 21138 Snapshot Daemon on gqas006.sbu.lab.eng.bos. redhat.com 49156 0 Y 2693 Self-heal Daemon on gqas006.sbu.lab.eng.bos .redhat.com N/A N/A Y 26922 Quota Daemon on gqas006.sbu.lab.eng.bos.red hat.com N/A N/A Y 26930 Snapshot Daemon on gqas005.sbu.lab.eng.bos. redhat.com 49156 0 Y 2667 Self-heal Daemon on gqas005.sbu.lab.eng.bos .redhat.com N/A N/A Y 26059 Quota Daemon on gqas005.sbu.lab.eng.bos.red hat.com N/A N/A Y 26072 Task Status of Volume butcher ------------------------------------------------------------------------------ Task : Remove brick ID : 11222ad2-3492-499e-95a6-57c41b9bfa69 Removed bricks: gqas005.sbu.lab.eng.bos.redhat.com:/bricks1/b1 gqas006.sbu.lab.eng.bos.redhat.com:/bricks1/b1 gqas013.sbu.lab.eng.bos.redhat.com:/bricks1/b1 Status : completed
Workload Description : Bonnie on 3 clients,kernel untar on one.
I've reproduced this issue consistently on : > Adding a brick and changing the replica count > Adding a brick without changing the replica count. > Remove-brick. Proposing this to block RHGS 3.2,to be taken up for consideration in the next blocker triage.
tcpdumps collected were after few hours of hung being observed and it contained only RENEW fops. pstack on server process also dint show any threads waiting or in hung state. Moreover, other client are able to access the same volume via same server. Looking at client logs - Feb 27 23:00:39 gqac015 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 27 23:00:39 gqac015 kernel: bonnie++ D ffffffff816895f0 0 11340 11338 0x00000080 Feb 27 23:00:39 gqac015 kernel: ffff880036477c30 0000000000000086 ffff880c1732ce70 ffff880036477fd8 Feb 27 23:00:39 gqac015 kernel: ffff880036477fd8 ffff880036477fd8 ffff880c1732ce70 ffff880617a56c40 Feb 27 23:00:39 gqac015 kernel: 0000000000000000 7fffffffffffffff ffff88062ff95de8 ffffffff816895f0 Feb 27 23:00:39 gqac015 kernel: Call Trace: Feb 27 23:00:39 gqac015 kernel: [<ffffffff816895f0>] ? bit_wait+0x50/0x50 Feb 27 23:00:39 gqac015 kernel: [<ffffffff8168b579>] schedule+0x29/0x70 Feb 27 23:00:39 gqac015 kernel: [<ffffffff81688fc9>] schedule_timeout+0x239/0x2d0 Feb 27 23:00:39 gqac015 kernel: [<ffffffffa066d124>] ? nfs_initiate_pgio+0xd4/0x160 [nfs] Feb 27 23:00:39 gqac015 kernel: [<ffffffff810eaa6c>] ? ktime_get_ts64+0x4c/0xf0 Feb 27 23:00:39 gqac015 kernel: [<ffffffff816895f0>] ? bit_wait+0x50/0x50 Feb 27 23:00:39 gqac015 kernel: [<ffffffff8168ab1e>] io_schedule_timeout+0xae/0x130 Feb 27 23:00:39 gqac015 kernel: [<ffffffff8168abb8>] io_schedule+0x18/0x20 Feb 27 23:00:39 gqac015 kernel: [<ffffffff81689601>] bit_wait_io+0x11/0x50 Feb 27 23:00:39 gqac015 kernel: [<ffffffff81689125>] __wait_on_bit+0x65/0x90 Feb 27 23:00:39 gqac015 kernel: [<ffffffff8117fb61>] wait_on_page_bit+0x81/0xa0 Feb 27 23:00:39 gqac015 kernel: [<ffffffff810b16c0>] ? wake_bit_function+0x40/0x40 Feb 27 23:00:39 gqac015 kernel: [<ffffffff8117fc91>] filemap_fdatawait_range+0x111/0x1b0 Feb 27 23:00:39 gqac015 kernel: [<ffffffff81181f96>] filemap_write_and_wait_range+0x56/0x90 Feb 27 23:00:39 gqac015 kernel: [<ffffffffa06bca71>] nfs4_file_fsync+0x81/0x150 [nfsv4] Feb 27 23:00:39 gqac015 kernel: [<ffffffff811fe1e2>] ? vfs_write+0x172/0x1e0 Feb 27 23:00:39 gqac015 kernel: [<ffffffff8122fdb5>] do_fsync+0x65/0xa0 Feb 27 23:00:39 gqac015 kernel: [<ffffffff81230080>] SyS_fsync+0x10/0x20 Feb 27 23:00:39 gqac015 kernel: [<ffffffff816964c9>] system_call_fastpath+0x16/0x1b NFS-Client may have got hung. Ben (cced) from NFS-Client team mentioned that recent v4 clients will wait forever if the server drops or fails to respond to a WRITE. Probably in this case as well server may have not responded or dropped response to WRITE request. Below warnings were observed on the server machine but cannot be sure if they are related - 27/02/2017 11:09:27 : epoch 48620000 : gqas013.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-22354[work-208] nfs_dupreq_finish :DUPREQ :WARN :DRC retire entries: unable to find reclaimable dupreq LRU entry after 5 tries on DRC=0x7fda2406ce40, drc->size=1025 27/02/2017 11:09:27 : epoch 48620000 : gqas013.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-22354[work-202] nfs_dupreq_finish :DUPREQ :WARN :DRC retire entries: unable to find reclaimable dupreq LRU entry after 5 tries on DRC=0x7fda2406ce40, drc->size=1026 27/02/2017 11:09:27 : epoch 48620000 : gqas013.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-22354[work-75] nfs_dupreq_finish :DUPREQ :WARN :DRC retire entries: unable to find reclaimable dupreq LRU entry after 5 tries on DRC=0x7fda2406ce40, drc->size=1124 27/02/2017 11:09:27 : epoch 48620000 : gqas013.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-22354[work-198] nfs_dupreq_finish :DUPREQ :WARN :DRC retire entries: unable to find reclaimable dupreq LRU entry after 5 tries on DRC=0x7fda2406ce40, drc->size=1125 27/02/2017 11:10:00 : epoch 48620000 : gqas013.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-22354[work-152] nfs_dupreq_finish :DUPREQ :WARN :DRC retire entries: unable to find reclaimable dupreq LRU entry after 5 tries on DRC=0x7fda2406ce40, drc->size=1136 @Frank/Dan/Matt, Do you infer anything from above warnings which may have caused server not to respond to the client. Please note that many of the drc fixes which went in upstream next branch are missing in the current downstream codebase (which is at 2.4.1). Also when Ambarish reproduced the issue today, the client was hung for long time (almost half a day) but seem to have recovered now (not sure how, could be network restart), but bonnie++ test-suite seem to be still in hung state. Could there be any issues with bonnie test-suite itself? [root@gqac015 ~]# [root@gqac015 ~]# ls /gluster-mount/ d1 dir1 dir4 new run11309 run1352 run2753 run3666 run5004 d2 dir2 dir5 run1011 run11315 run1477 run3030 run3739 run5177 d3 dir3 linux-4.9.12.tar.xz run1108 run1201 run1672 run3391 run4897 T [root@gqac015 ~]# [root@gqac015 ~]# ps aux | grep bonnie root 11315 0.0 0.0 113260 1568 pts/1 S+ Feb27 0:00 /bin/bash /opt/qa/tools/system_light/run.sh -w /gluster-mount -t bonnie -l /var/tmp/bonnie.log root 11338 0.0 0.0 113128 1428 pts/1 S+ Feb27 0:00 /bin/bash /opt/qa/tools/system_light/scripts/bonnie/bonnie.sh root 11340 0.1 0.0 16960 1380 pts/1 D+ Feb27 0:58 bonnie++ -u root -d /gluster-mount/run11315/ root 16791 0.0 0.0 112656 964 pts/3 S+ 06:59 0:00 grep --color=auto bonnie [root@gqac015 ~]#
I'm not sure if the warning has anything to do with the hang. I suspect it may be possible. Regardless, 1100 is a *lot* of dupreq entries, so something is wrong with the dupreq cache. There have been a lot of fixes recently relating to dupreq refcounting and races, mostly done by Malahal. Since this is reproducible, trying with a backport of these fixes may be a good idea. I haven't gone through all these fixes to see which ones might be necessary, but here's all of the potentially relevant ones, from oldest to newest: 223d1304dc920da574bfa4e203a19f602ae22c07 8abb27f582f67d018989a9033bd17697fdae85bc 44f9a0f56aaa258d2221729e878e7479342aa70f e3e88ab565d906b3283c3655985c22f83ced0ac2 9da22db00c28edc0e56fb6fbc105d5b82ea4cc71 269f0c3c23b4c0b82b18a9b5a5af466111cdbc10 (just changes log messages) 1e8e3158a57f6631e4b581bf8e4016b9989d2d17 d2acc676428e2cb504410799fb5eff920a3c1277 9cf5429fda1bfd07d98c8807d51754aadb26d8a0 b190d195279158e59867738904697ff125bce0b2
retest after rebase to 2.5.x
POST with rebase to nfs-ganesha-2.5.x
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2610