Description of problem: ====================== Few Snapshot creation failures seen with "quorum not met" & "brick ops failed" messages when multiple file/directory creation is in progress from fuse and nfs mount Version-Release number of selected component (if applicable): ============================================================ glusterfs-3.6.0.11-1.el6rhs.x86_64 How reproducible: ================ 1/1 Steps to Reproduce: ================== 1. Setup a cluster of 4 servers (server1,server2,server3,server4) 2. Create four volumes from these servers (vol0,vol1,vol2,vol3) 3. Mount the volumes on client (Fuse and NFS mount) 4. create directories names as f and n from each fuse mount of volumes 5. cd to f from all fuse mounts of volumes 6. cd to n from all nfs mounts of volumes 7. Start creating heavy IO from all the fuse (f) mount and nfs (n) mount of every volume. for i in {1..50} ; do cp -rvf /etc etc.$i ; done 8. While IO is in progress create snapshots on all volumes from different nodes for i in {1..256} ; do gluster snapshot create snap_vol1_$i vol0 ; done for i in {1..256} ; do gluster snapshot create snap_vol2_$i vol1 ; done for i in {1..256} ; do gluster snapshot create snap_vol2_$i vol2 ; done for i in {1..256} ; do gluster snapshot create snap_vol3_$i vol3 ; done Initially few snapshots were not created as snapshot creation failed as it crossed the 2 min cli timeout. Then one snapshot creation failed with "quorum not met" error message snapshot13 : snapshot create: success: Snap snap_vol0_41 created successfully snapshot create: failed: quorum is not met Snapshot command failed Checked gluster volume info Then snapshot creation failed with "brick ops failed" error message snapshot create: success: Snap snap_vol0_40 created successfully snapshot create: success: Snap snap_vol0_41 created successfully snapshot create: failed: quorum is not met Snapshot command failed snapshot create: failed: Another transaction is in progress Please try again after sometime. Snapshot command failed snapshot create: failed: Brick ops failed on snapshot14.lab.eng.blr.redhat.com. Please check log file for details. Brick ops failed on snapshot16.lab.eng.blr.redhat.com. Please check log file for details. Brick ops failed on snapshot15.lab.eng.blr.redhat.com. Please check log file for details. Snapshot command failed There are also many brick disconnect messages seen in the log [2014-06-03 09:57:17.220019] I [socket.c:2239:socket_event_handler] 0-transport: disconnecting now [2014-06-03 09:57:18.009250] I [MSGID: 106005] [glusterd-handler.c:4126:__glusterd_brick_rpc_notify] 0-management: Brick snapshot13.lab.eng.blr.redhat.com:/var/run/gluster/snaps/550f650254c84564b8546a9905644493/brick1/b3 has disconnected from glusterd. -------------Part of log messages---------------------------- snapshot13.lab.eng.blr.redhat.com:/var/run/gluster/snaps/d5907135f6524917bcabeb4d69d9ea33/brick1/b0 has disconnec/disc/qted from glusterd. [2014-06-03 09:55:28.962797] W [glusterd-utils.c:1558:glusterd_snap_volinfo_find] 0-management: Snap volume 7bb6a9a00 5814aa5868a4322b586414b.snapshot13.lab.eng.blr.redhat.com.var-run-gluster-snaps-7bb6a9a005814aa5868a4322b586414b-bric k1-b2 not found [2014-06-03 09:55:28.963226] W [glusterd-utils.c:1558:glusterd_snap_volinfo_find] 0-management: Snap volume d5907135f 6524917bcabeb4d69d9ea33.snapshot13.lab.eng.blr.redhat.com.var-run-gluster-snaps-d5907135f6524917bcabeb4d69d9ea33-bric k1-b0 not found [2014-06-03 09:55:29.114718] E [glusterd-utils.c:12489:glusterd_volume_quorum_check] 0-management: quorum is not met [2014-06-03 09:55:29.120722] W [glusterd-utils.c:12715:glusterd_snap_quorum_check_for_create] 0-management: volume d5907135f6524917bcabeb4d69d9ea33 is not in quorum [2014-06-03 09:55:29.120749] W [glusterd-utils.c:12754:glusterd_snap_quorum_check] 0-management: Quorum checkfailed during snapshot create command [2014-06-03 09:55:29.120766] W [glusterd-mgmt.c:1928:glusterd_mgmt_v3_initiate_snap_phases] 0-management: quorum check failed [2014-06-03 09:55:29.121124] E [rpc-transport.c:481:rpc_transport_unref] (-->/usr/lib64/glusterfs/3.6.0.11/xlator/mgmt/glusterd.so(glusterd_brick_disconnect+0x38) [0x7f9564e2f298] (-->/usr/lib64/glusterfs/3.6.0.11/xlator/mgmt/glusterd.so(glusterd_rpc_clnt_unref+0x35) [0x7f9564e2f155] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_unref+0x63) [0x3564e0d633]))) 0-rpc_transport: invalid argument: this ------------------------------------------------------------- Actual results: Expected results: Additional info:
sosreports : http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/1104191/
Version : glusterfs 3.6.0.22 Went through the logs and found that the barrier timedout as it took more than 2 mins because parallel snapshot creation was in progress. Modifying this bug to track the issue where snapshot might take more than 2 mins when parallel snapshot create is in progress while heavy IO also is in progress. Tried the same case on physical machines without any failure. This needs to be documented
Please review and signoff edited doc text.
There is no end to increasing the CLI timeout. With increase of nodes, the time taken exponentially increases. Current Gluster architecture does not support implementation of this feature. Therefore this feature request is deferred till Gluterd 2.0.