Description of problem: Create snaphost failed with error "unbarrier brick opfailed with the error quorum is not met" Version-Release number of selected component (if applicable): glusterfs 3.6.0.10 How reproducible: Steps to Reproduce: 1. create 3 * 3 distributed replicated volume 2. Create snapshot 3. Actual results: [2014-06-03 11:48:57.070956] W [glusterd-mgmt.c:1928:glusterd_mgmt_v3_initiate_snap_phases] 0-management: quorum check failed [2014-06-03 11:48:57.071275] E [rpc-transport.c:481:rpc_transport_unref] (-->/usr/lib64/glusterfs/3.6.0.10/xlator/mgmt/glusterd.so(glusterd_brick_disconnect+0x38) [0x7fe8105f8298] (-->/usr/lib64/glusterfs/3.6.0.10/xlator/mgmt/glusterd.so(glusterd_rpc_clnt_unref+0x35) [0x7fe8105f8155] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_unref+0x63) [0x3f0720d633]))) 0-rpc_transport: invalid argument: this [2014-06-03 11:48:57.073964] E [glusterd-utils.c:1939:glusterd_brick_unlink_socket_file] 0-management: Failed to remove /var/run/ec27630a5d765ac60f0815d2373d69ee.socket error: Permission denied [2014-06-03 11:48:57.074135] E [rpc-transport.c:481:rpc_transport_unref] (-->/usr/lib64/glusterfs/3.6.0.10/xlator/mgmt/glusterd.so(glusterd_brick_disconnect+0x38) [0x7fe8105f8298] (-->/usr/lib64/glusterfs/3.6.0.10/xlator/mgmt/glusterd.so(glusterd_rpc_clnt_unref+0x35) [0x7fe8105f8155] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_unref+0x63) [0x3f0720d633]))) 0-rpc_transport: invalid argument: this [2014-06-03 11:48:57.074276] E [glusterd-utils.c:1939:glusterd_brick_unlink_socket_file] 0-management: Failed to remove /var/run/60cbf17686c7e8586babf35271de0dfe.socket error: Permission denied [2014-06-03 11:48:57.074339] I [glusterd-utils.c:1608:glusterd_service_stop] 0-management: brick already stopped [2014-06-03 11:48:58.513807] I [glusterd-snapshot.c:1972:glusterd_lvm_snapshot_remove] 0-management: snapshot was pending. lvm not present for brick 10.70.36.231:/var/run/gluster/snaps/5feab1d45a1c4f928bb0624425800fd3/brick9/napbrick1/d3r33 of the snap snap22. [2014-06-03 11:48:58.515714] E [glusterd-snapshot.c:5931:glusterd_snapshot_create_postvalidate] 0-management: unable to find snap snap22 [2014-06-03 11:48:58.516128] W [glusterd-utils.c:1499:glusterd_snap_volinfo_find_by_volume_id] 0-management: Snap volume not found [2014-06-03 11:48:58.516313] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /var/run/gluster/snaps/5feab1d45a1c4f928bb0624425800fd3/brick1/d1r12 on port 49163 [2014-06-03 11:48:58.518011] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /var/run/gluster/snaps/5feab1d45a1c4f928bb0624425800fd3/brick5/d2r22 on port 49162 [2014-06-03 11:48:59.792796] E [glusterd-mgmt.c:1962:glusterd_mgmt_v3_initiate_snap_phases] 0-management: unbarrier brick opfailed with the error quorum is not met Expected results: Snapshot should be created. Additional info: [root@rhsauto001 ~]# gluster v info Volume Name: snapvol Type: Distributed-Replicate Volume ID: 1a3ea597-1b1f-477d-aed7-b43da2fb9304 Status: Started Snap Volume: no Number of Bricks: 3 x 3 = 9 Transport-type: tcp Bricks: Brick1: 10.70.36.231:/var/run/gluster/snaps/ed005655010c4a33a55e276eb2be3d71/brick1/d1r12 Brick2: 10.70.36.233:/var/run/gluster/snaps/ed005655010c4a33a55e276eb2be3d71/brick2/d1r22 Brick3: 10.70.36.236:/var/run/gluster/snaps/ed005655010c4a33a55e276eb2be3d71/brick3/d1r33 Brick4: 10.70.36.237:/var/run/gluster/snaps/ed005655010c4a33a55e276eb2be3d71/brick4/d2r12 Brick5: 10.70.36.231:/var/run/gluster/snaps/ed005655010c4a33a55e276eb2be3d71/brick5/d2r22 Brick6: 10.70.36.233:/var/run/gluster/snaps/ed005655010c4a33a55e276eb2be3d71/brick6/d2r33 Brick7: 10.70.36.236:/snapbrick1/d3r12 Brick8: 10.70.36.237:/snapbrick1/d3r22 Brick9: 10.70.36.231:/snapbrick1/d3r33 Options Reconfigured: cluster.entry-self-heal: off cluster.metadata-self-heal: off cluster.data-self-heal: off cluster.self-heal-daemon: off features.barrier: disable performance.open-behind: off performance.quick-read: off performance.io-cache: off performance.read-ahead: off performance.write-behind: off
Created attachment 901973 [details] Log attached
Tried reproducing the bug with the latest code(upstream), but couldnt reproduce it. The setup used: Volume Name: gv0 Type: Distributed-Replicate Volume ID: 025c4f1c-f67d-4f31-a0b5-d6c5f7aa0466 Status: Started Number of Bricks: 3 x 3 = 9 Transport-type: tcp Bricks: Brick1: joeremote1:/var/run/gluster/snaps/02c0997efc5f4f1f8b90cd7d75afddef/brick1/brick1 Brick2: joeremote1:/var/run/gluster/snaps/02c0997efc5f4f1f8b90cd7d75afddef/brick2/brick2 Brick3: joeremote2:/var/run/gluster/snaps/02c0997efc5f4f1f8b90cd7d75afddef/brick3/brick1 Brick4: joeremote2:/var/run/gluster/snaps/02c0997efc5f4f1f8b90cd7d75afddef/brick4/brick2 Brick5: joeremote2:/var/run/gluster/snaps/02c0997efc5f4f1f8b90cd7d75afddef/brick5/brick3 Brick6: joeremote1:/var/run/gluster/snaps/02c0997efc5f4f1f8b90cd7d75afddef/brick6/brick3 Brick7: joeremote1:/export4/tmp/brick4 Brick8: joeremote1:/export5/tmp/brick5 Brick9: joeremote2:/export4/tmp/brick4 Options Reconfigured: features.barrier: disable Will analysis the logs and will update the result.
Looking at the logs these are the observation: 1) One of the volume brick Brick9: 10.70.36.231:/snapbrick1/d3r33 was down when the snapshot was taken. So snapshot create force command was used. glusterd logs: [2014-06-03 11:46:50.272440] I [MSGID: 106005] [glusterd-handler.c:4126:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.36.231:/snapbrick1/d3r33 has disconnected from glusterd. [2014-06-03 11:48:31.266782] W [glusterd-snapshot.c:1630:glusterd_snapshot_create_prevalidate] 0-management: brick 10.70.36.231:/snapbrick1/d3r33 is not started Brick log: [2014-06-03 11:46:50.266276] W [glusterfsd.c:1182:cleanup_and_exit] (--> 0-: received signum (15), shutting down 2) The snapshot create commit on the local system was successful. 3) The snapshot create commit on remote peer system was unsuccessful glusterd log: [2014-06-03 11:48:46.638882] E [glusterd-mgmt.c:116:gd_mgmt_v3_collate_errors] 0-management: Commit failed on 10.70.36.237. Please check log file for details. [2014-06-03 11:48:54.338041] E [glusterd-mgmt.c:116:gd_mgmt_v3_collate_errors] 0-management: Commit failed on 10.70.36.236. Please check log file for details. [2014-06-03 11:48:54.338154] E [glusterd-mgmt.c:1173:glusterd_mgmt_v3_commit] 0-management: Commit failed on peers [2014-06-03 11:48:54.338194] E [glusterd-mgmt.c:1894:glusterd_mgmt_v3_initiate_snap_phases] 0-management: Commit Op Failed The reason why it failed can be found out if we find the glusterd/glusterfsd logs from the peer nodes, which are not attached with the bug 4) Once the remote peer commit failed, the unbarrier brick-op is called, which doesnt fail! 5) After that the snap volume quorum is check and it fails. 6) Post-validate is called and the cleanup is done Anil, could you please provide the sos-report from the other nodes. So that we can pin point by snapshot create commit failed on those nodes.
Joseph I don't have sos-reports from other nodes.
1) since the reason for the commit on the remote nodes is not known (due to absence of logs) we cannot pin point the reason for the failure. 2) The log/cli messaging can be improved here i.e there is not unbarriering fail but still we have the message "unbarriering failed with quorum failed" which is miss leading.
Version: glusterfs 3.6.0.20 built on Jun 19 2014 ======= I got the following error message while attaching a new node to the cluster while snapshot create was in progress snapshot create: success: Snap snap4 created successfully snapshot create: failed: glusterds are not in quorum Snapshot command failed snapshot create: success: Snap snap6 created successfully All glusterds were up and running on the nodes , but still we get the message that glusterd quorum is not met. ----------------Part of log--------------------- name:snapshot15.lab.eng.blr.redhat.com [2014-06-23 06:03:31.887252] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: 7e97d0f0-8ae9-40eb-b822-952cc5a8dc46, host name:10.70.44.54 [2014-06-23 06:03:32.166226] W [glusterd-utils.c:12909:glusterd_snap_quorum_check_for_create] 0-management: glusterds are not in quorum [2014-06-23 06:03:32.166352] W [glusterd-utils.c:13058:glusterd_snap_quorum_check] 0-management: Quorum checkfailed during snapshot create command [2014-06-23 06:03:32.166374] W [glusterd-mgmt.c:1846:glusterd_mgmt_v3_initiate_snap_phases] 0-management: quorum check failed [2014-06-23 06:03:32.166416] W [glusterd-snapshot.c:7012:glusterd_snapshot_postvalidate] 0-management: Snapshot create post-validation failed [2014-06-23 06:03:32.166433] W [glusterd-mgmt.c:248:gd_mgmt_v3_post_validate_fn] 0-management: postvalidate operation failed [2014-06-23 06:03:32.166451] E [glusterd-mgmt.c:1335:glusterd_mgmt_v3_post_validate] 0-management: Post Validation failed for operation Snapshot on local node [2014-06-23 06:03:32.166467] E [glusterd-mgmt.c:1944:glusterd_mgmt_v3_initiate_snap_phases] 0-management: Post Validation Failed [2014-06-23 06:03:33.972792] I [glusterd-handshake.c:1014:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30000
Seema could you please attach the SOS reports of all the nodes
Sorry for removing the blocks, Adding it again
sosreports for comment 8 : ======================== http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/snapshots/1104478/
1) The Issue in this bug is that the message "unbarrier brick opfailed with the error quorum is not met" is printed when actually a. The commit as failed b. The Unbarrier is not failed! c. The quorum check for a failed "SNAP VOLUME" is done - which is bond to fail. The quorum check for the "SNAP VOLUME" is not need when create commit is failed. The fix for this bug is that we dont do a quorum check for a SNAP VOLUME whose commit has failed. And Print the correct message and not a Static message which is not related. 2) The Bug 1085278 is not related to this issue. Reasons as follows, a. Investigation of the logs show that quorum of the MAIN VOLUME is failed just after the prevalidate(Please note: Pre-validate is passed) b. The scenario of this bug is different as we have not yet entered in commit phases. c. We are not having any "Brick Ops Failed" for unbarriering. I do agree its a bug and needs to be investigated separately but its not related to this. And This Bug should not block 1085278 Removing form the bug 1085278 blocks list.
Raised another bz 1112250 to track the issue mentioned in Comment 8
Fix submitted upstream: Anand Avati 2014-06-24 04:53:30 EDT REVIEW: http://review.gluster.org/8158 (glusterd/snapshot : Fixing Msging in glusterd_mgmt_v3_initiate_snap_phases) posted (#1) for review on master by Joseph Fernandes (josferna)
Also in logs i am able to see brick path like /var/run/gluster/snaps/5feab1d45a1c4f928bb0624425800fd3/brick9/napbrick1/d3r33 See the description logs for more details.
Since i am not able to see error "quorum check failed " with every snapshot failure. Marking this bug verified on build glusterfs-3.7.1-9.el6rhs.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html