+++ This bug was initially created as a clone of Bug #1499004 +++ +++ This bug was initially created as a clone of Bug #1425681 +++ Description of problem: ======================= Had a tiered volume with 2*(4+2) as cold tier and plain distribute 1*4 as hot tier in 6 node cluster. Had I/O taking place from fuse as well as nfs mounts, (ofcourse) in different directories. During the watermarks testing wrt the volume, reduced the values of low and high watermark, which resulted in the data percentage of hot tier exceeding the high-watermark - which should result in demotions (only). Was monitoring the demotions taking place via the command 'gluster volume tier <volname> status'. After a while, the said command started failing with 'Another transaction is in progress for <volname>. Please try again after sometime'. And it has got stuck in that state since a day now. Glusterd logs complain of 'another lock being held by <uuid>'. (Do not think it is related, but fyi) While monitoring the demotions 'gluster volume tier <volname> status' and waiting for them to get completed, I did create a new dist-rep volume 2*2 and set 'nfs.disable' to 'off'. Soon after that when I repeated the 'tier status' command, it started failing with '...another transaction is in progress...' 'glusterd restart' (as advised by Atin) on the node (which had held the lock) seems to have got the volume back to normal. Sosreports at : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/ Version-Release number of selected component (if applicable): =========================================================== 3.8.4-14 How reproducible: ================= 1:1 Additional info: ================== [root@dhcp46-221 ~]# [root@dhcp46-221 ~]# gluster peer stauts unrecognized word: stauts (position 1) [root@dhcp46-221 ~]# gluster peer status Number of Peers: 5 Hostname: dhcp46-242.lab.eng.blr.redhat.com Uuid: 838465bf-1fd8-4f85-8599-dbc8367539aa State: Peer in Cluster (Connected) Other names: 10.70.46.242 Hostname: 10.70.46.239 Uuid: b9af0965-ffe7-4827-b610-2380a8fa810b State: Peer in Cluster (Connected) Hostname: 10.70.46.240 Uuid: 5bff39d7-cd9c-4dbb-86eb-2a7ba6dfea3d State: Peer in Cluster (Connected) Hostname: 10.70.46.218 Uuid: c2fbc432-b7a9-4db1-9b9d-a8d82e998923 State: Peer in Cluster (Connected) Hostname: 10.70.46.222 Uuid: 81184471-cbf7-47aa-ba41-21f32bb644b0 State: Peer in Cluster (Connected) [root@dhcp46-221 ~]# vim /var/log/glusterfs/glusterd.log [root@dhcp46-221 ~]# gluster v status Another transaction is in progress for ozone. Please try again after sometime. Status of volume: vola Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.46.239:/bricks/brick3/vola_0 49152 0 Y 5259 Brick 10.70.46.240:/bricks/brick3/vola_1 49152 0 Y 20012 Brick 10.70.46.242:/bricks/brick3/vola_2 49153 0 Y 21512 Brick 10.70.46.218:/bricks/brick3/vola_3 49155 0 Y 28705 NFS Server on localhost 2049 0 Y 31911 Self-heal Daemon on localhost N/A N/A Y 31743 NFS Server on dhcp46-242.lab.eng.blr.redhat .com 2049 0 Y 21788 Self-heal Daemon on dhcp46-242.lab.eng.blr. redhat.com N/A N/A Y 21563 NFS Server on 10.70.46.239 2049 0 Y 5699 Self-heal Daemon on 10.70.46.239 N/A N/A Y 5291 NFS Server on 10.70.46.218 2049 0 Y 28899 Self-heal Daemon on 10.70.46.218 N/A N/A Y 28759 NFS Server on 10.70.46.240 2049 0 Y 20201 Self-heal Daemon on 10.70.46.240 N/A N/A Y 20061 NFS Server on 10.70.46.222 2049 0 Y 1784 Self-heal Daemon on 10.70.46.222 N/A N/A Y 1588 Task Status of Volume vola ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp46-221 ~]# [root@dhcp46-221 ~]# rpm -qa | grep gluster glusterfs-libs-3.8.4-14.el7rhgs.x86_64 glusterfs-fuse-3.8.4-14.el7rhgs.x86_64 glusterfs-rdma-3.8.4-14.el7rhgs.x86_64 vdsm-gluster-4.17.33-1.1.el7rhgs.noarch gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-client-xlators-3.8.4-14.el7rhgs.x86_64 glusterfs-cli-3.8.4-14.el7rhgs.x86_64 glusterfs-events-3.8.4-14.el7rhgs.x86_64 gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64 glusterfs-server-3.8.4-14.el7rhgs.x86_64 python-gluster-3.8.4-14.el7rhgs.noarch glusterfs-geo-replication-3.8.4-14.el7rhgs.x86_64 glusterfs-3.8.4-14.el7rhgs.x86_64 glusterfs-api-3.8.4-14.el7rhgs.x86_64 [root@dhcp46-221 ~]# [root@dhcp46-221 ~]# ########## after glusterd restart ################ [root@dhcp46-221 ~]# [root@dhcp46-221 ~]# gluster v status ozone Status of volume: ozone Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Hot Bricks: Brick 10.70.46.222:/bricks/brick2/ozone_tie r3 49152 0 Y 18409 Brick 10.70.46.221:/bricks/brick2/ozone_tie r2 49152 0 Y 16208 Brick 10.70.46.218:/bricks/brick2/ozone_tie r1 49152 0 Y 1655 Brick 10.70.46.242:/bricks/brick2/ozone_tie r0 49152 0 Y 23869 Cold Bricks: Brick 10.70.46.239:/bricks/brick0/ozone_0 49156 0 Y 18567 Brick 10.70.46.240:/bricks/brick0/ozone_1 49156 0 Y 21626 Brick 10.70.46.242:/bricks/brick0/ozone_2 49156 0 Y 10841 Brick 10.70.46.218:/bricks/brick0/ozone_3 49153 0 Y 27354 Brick 10.70.46.221:/bricks/brick0/ozone_4 49154 0 Y 2139 Brick 10.70.46.222:/bricks/brick0/ozone_5 49154 0 Y 4378 Brick 10.70.46.239:/bricks/brick1/ozone_6 49157 0 Y 18587 Brick 10.70.46.240:/bricks/brick1/ozone_7 49157 0 Y 21646 Brick 10.70.46.242:/bricks/brick1/ozone_8 49157 0 Y 10861 Brick 10.70.46.218:/bricks/brick1/ozone_9 49154 0 Y 27353 Brick 10.70.46.221:/bricks/brick1/ozone_10 49155 0 Y 2159 Brick 10.70.46.222:/bricks/brick1/ozone_11 49155 0 Y 4398 NFS Server on localhost 2049 0 Y 5622 Self-heal Daemon on localhost N/A N/A Y 5630 Quota Daemon on localhost N/A N/A Y 5639 NFS Server on 10.70.46.239 2049 0 Y 15129 Self-heal Daemon on 10.70.46.239 N/A N/A Y 15152 Quota Daemon on 10.70.46.239 N/A N/A Y 15189 NFS Server on 10.70.46.240 2049 0 Y 25626 Self-heal Daemon on 10.70.46.240 N/A N/A Y 25647 Quota Daemon on 10.70.46.240 N/A N/A Y 25657 NFS Server on dhcp46-242.lab.eng.blr.redhat .com 2049 0 Y 20513 Self-heal Daemon on dhcp46-242.lab.eng.blr. redhat.com N/A N/A Y 20540 Quota Daemon on dhcp46-242.lab.eng.blr.redh at.com N/A N/A Y 20565 NFS Server on 10.70.46.222 2049 0 Y 6509 Self-heal Daemon on 10.70.46.222 N/A N/A Y 6532 Quota Daemon on 10.70.46.222 N/A N/A Y 6549 NFS Server on 10.70.46.218 2049 0 Y 11094 Self-heal Daemon on 10.70.46.218 N/A N/A Y 11120 Quota Daemon on 10.70.46.218 N/A N/A Y 11143 Task Status of Volume ozone ------------------------------------------------------------------------------ Task : Tier migration ID : 19fb4787-d9de-4436-8f15-86ff39fbc7bb Status : in progress [root@dhcp46-221 ~]# gluster v tier ozone status Node Promoted files Demoted files Status --------- --------- --------- --------- localhost 0 2033 in progress dhcp46-242.lab.eng.blr.redhat.com 0 2025 in progress 10.70.46.239 14 0 in progress 10.70.46.240 0 0 in progress 10.70.46.218 0 2238 in progress 10.70.46.222 0 2167 in progress Tiering Migration Functionality: ozone: success [root@dhcp46-221 ~]# --- Additional comment from Worker Ant on 2017-10-05 15:08:42 EDT --- REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in mgmt_v3_lock) posted (#2) for review on master by Gaurav Yadav (gyadav) --- Additional comment from Worker Ant on 2017-10-05 23:22:30 EDT --- REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in mgmt_v3_lock) posted (#3) for review on master by Gaurav Yadav (gyadav) --- Additional comment from Worker Ant on 2017-10-06 03:32:47 EDT --- REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in mgmt_v3_lock) posted (#4) for review on master by Gaurav Yadav (gyadav) --- Additional comment from Worker Ant on 2017-10-06 10:44:57 EDT --- REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in mgmt_v3_lock) posted (#5) for review on master by Gaurav Yadav (gyadav) --- Additional comment from Worker Ant on 2017-10-10 12:29:10 EDT --- REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in mgmt_v3_lock) posted (#6) for review on master by Gaurav Yadav (gyadav) --- Additional comment from Worker Ant on 2017-10-10 13:39:02 EDT --- REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in mgmt_v3_lock) posted (#7) for review on master by Gaurav Yadav (gyadav) --- Additional comment from Worker Ant on 2017-10-10 21:51:58 EDT --- REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in mgmt_v3_lock) posted (#8) for review on master by Gaurav Yadav (gyadav) --- Additional comment from Worker Ant on 2017-10-12 01:32:14 EDT --- REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in mgmt_v3_lock) posted (#9) for review on master by Gaurav Yadav (gyadav) --- Additional comment from Worker Ant on 2017-10-17 00:59:37 EDT --- REVIEW: https://review.gluster.org/18437 (glusterd : introduce timer in mgmt_v3_lock) posted (#10) for review on master by Atin Mukherjee (amukherj) --- Additional comment from Worker Ant on 2017-10-17 11:44:54 EDT --- COMMIT: https://review.gluster.org/18437 committed in master by Atin Mukherjee (amukherj) ------ commit 614904fa7a31bf6f69074238b7e710a20e05e1bb Author: Gaurav Yadav <gyadav> Date: Thu Oct 5 23:44:46 2017 +0530 glusterd : introduce timer in mgmt_v3_lock Problem: In a multinode environment, if two of the op-sm transactions are initiated on one of the receiver nodes at the same time, there might be a possibility that glusterd may end up in stale lock. Solution: During mgmt_v3_lock a registration is made to gf_timer_call_after which release the lock after certain period of time Change-Id: I16cc2e5186a2e8a5e35eca2468b031811e093843 BUG: 1499004 Signed-off-by: Gaurav Yadav <gyadav> --- Additional comment from Shyamsundar on 2017-12-08 12:42:08 EST --- This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.13.0, please open a new bug report. glusterfs-3.13.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-December/000087.html [2] https://www.gluster.org/pipermail/gluster-users/
REVIEW: https://review.gluster.org/19730 (glusterd : introduce timer in mgmt_v3_lock) posted (#1) for review on release-3.10 by Atin Mukherjee
COMMIT: https://review.gluster.org/19730 committed in release-3.10 by "Shyamsundar Ranganathan" <srangana> with a commit message- glusterd : introduce timer in mgmt_v3_lock Problem: In a multinode environment, if two of the op-sm transactions are initiated on one of the receiver nodes at the same time, there might be a possibility that glusterd may end up in stale lock. Solution: During mgmt_v3_lock a registration is made to gf_timer_call_after which release the lock after certain period of time >Change-Id: I16cc2e5186a2e8a5e35eca2468b031811e093843 >BUG: 1499004 >Signed-off-by: Gaurav Yadav <gyadav> Change-Id: I16cc2e5186a2e8a5e35eca2468b031811e093843 BUG: 1557304 Signed-off-by: Gaurav Yadav <gyadav>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.12, please open a new bug report. glusterfs-3.10.12 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2018-April/000095.html [2] https://www.gluster.org/pipermail/gluster-users/