1104478 – [SNAPSHOT] Create snaphost failed with error "unbarrier brick opfailed with the error quorum is not met"

Bug 1104478 - [SNAPSHOT] Create snaphost failed with error "unbarrier brick opfailed with the error quorum is not met"

Summary: [SNAPSHOT] Create snaphost failed with error "unbarrier brick opfailed with t...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	snapshot
Sub Component:
Version:	rhgs-3.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	RHGS 3.1.0
Assignee:	Joseph Elwin Fernandes
QA Contact:	Anil Shah
Docs Contact:
URL:
Whiteboard:	SNAPSHOT
Depends On:
Blocks:	1112559 1202842 1223636
TreeView+	depends on / blocked

Reported:	2014-06-04 06:21 UTC by Anil Shah
Modified:	2016-09-17 12:52 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.7.0-3.el6rhs
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1112559 (view as bug list)
Environment:
Last Closed:	2015-07-29 04:32:21 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Log attached (11.21 MB, application/x-xz) 2014-06-04 06:25 UTC, Anil Shah	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:1495	0	normal	SHIPPED_LIVE	Important: Red Hat Gluster Storage 3.1 update	2015-07-29 08:26:26 UTC

Description Anil Shah 2014-06-04 06:21:27 UTC

Description of problem:

Create snaphost failed with error "unbarrier brick opfailed with the error quorum is not met"

Version-Release number of selected component (if applicable):

glusterfs 3.6.0.10 

How reproducible:


Steps to Reproduce:
1. create 3 * 3 distributed replicated volume
2. Create snapshot 
3.

Actual results:

[2014-06-03 11:48:57.070956] W [glusterd-mgmt.c:1928:glusterd_mgmt_v3_initiate_snap_phases] 0-management: quorum check failed
[2014-06-03 11:48:57.071275] E [rpc-transport.c:481:rpc_transport_unref] (-->/usr/lib64/glusterfs/3.6.0.10/xlator/mgmt/glusterd.so(glusterd_brick_disconnect+0x38) [0x7fe8105f8298] (-->/usr/lib64/glusterfs/3.6.0.10/xlator/mgmt/glusterd.so(glusterd_rpc_clnt_unref+0x35) [0x7fe8105f8155] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_unref+0x63) [0x3f0720d633]))) 0-rpc_transport: invalid argument: this
[2014-06-03 11:48:57.073964] E [glusterd-utils.c:1939:glusterd_brick_unlink_socket_file] 0-management: Failed to remove /var/run/ec27630a5d765ac60f0815d2373d69ee.socket error: Permission denied
[2014-06-03 11:48:57.074135] E [rpc-transport.c:481:rpc_transport_unref] (-->/usr/lib64/glusterfs/3.6.0.10/xlator/mgmt/glusterd.so(glusterd_brick_disconnect+0x38) [0x7fe8105f8298] (-->/usr/lib64/glusterfs/3.6.0.10/xlator/mgmt/glusterd.so(glusterd_rpc_clnt_unref+0x35) [0x7fe8105f8155] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_unref+0x63) [0x3f0720d633]))) 0-rpc_transport: invalid argument: this
[2014-06-03 11:48:57.074276] E [glusterd-utils.c:1939:glusterd_brick_unlink_socket_file] 0-management: Failed to remove /var/run/60cbf17686c7e8586babf35271de0dfe.socket error: Permission denied
[2014-06-03 11:48:57.074339] I [glusterd-utils.c:1608:glusterd_service_stop] 0-management: brick already stopped
[2014-06-03 11:48:58.513807] I [glusterd-snapshot.c:1972:glusterd_lvm_snapshot_remove] 0-management: snapshot was pending. lvm not present for brick 10.70.36.231:/var/run/gluster/snaps/5feab1d45a1c4f928bb0624425800fd3/brick9/napbrick1/d3r33 of the snap snap22.
[2014-06-03 11:48:58.515714] E [glusterd-snapshot.c:5931:glusterd_snapshot_create_postvalidate] 0-management: unable to find snap snap22
[2014-06-03 11:48:58.516128] W [glusterd-utils.c:1499:glusterd_snap_volinfo_find_by_volume_id] 0-management: Snap volume not found
[2014-06-03 11:48:58.516313] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /var/run/gluster/snaps/5feab1d45a1c4f928bb0624425800fd3/brick1/d1r12 on port 49163
[2014-06-03 11:48:58.518011] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /var/run/gluster/snaps/5feab1d45a1c4f928bb0624425800fd3/brick5/d2r22 on port 49162
[2014-06-03 11:48:59.792796] E [glusterd-mgmt.c:1962:glusterd_mgmt_v3_initiate_snap_phases] 0-management: unbarrier brick opfailed with the error quorum is not met

Expected results:

Snapshot should be created.

Additional info:

[root@rhsauto001 ~]# gluster v info
 
Volume Name: snapvol
Type: Distributed-Replicate
Volume ID: 1a3ea597-1b1f-477d-aed7-b43da2fb9304
Status: Started
Snap Volume: no
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: 10.70.36.231:/var/run/gluster/snaps/ed005655010c4a33a55e276eb2be3d71/brick1/d1r12
Brick2: 10.70.36.233:/var/run/gluster/snaps/ed005655010c4a33a55e276eb2be3d71/brick2/d1r22
Brick3: 10.70.36.236:/var/run/gluster/snaps/ed005655010c4a33a55e276eb2be3d71/brick3/d1r33
Brick4: 10.70.36.237:/var/run/gluster/snaps/ed005655010c4a33a55e276eb2be3d71/brick4/d2r12
Brick5: 10.70.36.231:/var/run/gluster/snaps/ed005655010c4a33a55e276eb2be3d71/brick5/d2r22
Brick6: 10.70.36.233:/var/run/gluster/snaps/ed005655010c4a33a55e276eb2be3d71/brick6/d2r33
Brick7: 10.70.36.236:/snapbrick1/d3r12
Brick8: 10.70.36.237:/snapbrick1/d3r22
Brick9: 10.70.36.231:/snapbrick1/d3r33
Options Reconfigured:
cluster.entry-self-heal: off
cluster.metadata-self-heal: off
cluster.data-self-heal: off
cluster.self-heal-daemon: off
features.barrier: disable
performance.open-behind: off
performance.quick-read: off
performance.io-cache: off
performance.read-ahead: off
performance.write-behind: off

Comment 1 Anil Shah 2014-06-04 06:25:47 UTC

Created attachment 901973 [details]
Log attached

Comment 4 Joseph Elwin Fernandes 2014-06-12 12:10:39 UTC

Tried reproducing the bug with the latest code(upstream), but couldnt reproduce it. 

The setup used:

Volume Name: gv0
Type: Distributed-Replicate
Volume ID: 025c4f1c-f67d-4f31-a0b5-d6c5f7aa0466
Status: Started
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: joeremote1:/var/run/gluster/snaps/02c0997efc5f4f1f8b90cd7d75afddef/brick1/brick1
Brick2: joeremote1:/var/run/gluster/snaps/02c0997efc5f4f1f8b90cd7d75afddef/brick2/brick2
Brick3: joeremote2:/var/run/gluster/snaps/02c0997efc5f4f1f8b90cd7d75afddef/brick3/brick1
Brick4: joeremote2:/var/run/gluster/snaps/02c0997efc5f4f1f8b90cd7d75afddef/brick4/brick2
Brick5: joeremote2:/var/run/gluster/snaps/02c0997efc5f4f1f8b90cd7d75afddef/brick5/brick3
Brick6: joeremote1:/var/run/gluster/snaps/02c0997efc5f4f1f8b90cd7d75afddef/brick6/brick3
Brick7: joeremote1:/export4/tmp/brick4
Brick8: joeremote1:/export5/tmp/brick5
Brick9: joeremote2:/export4/tmp/brick4
Options Reconfigured:
features.barrier: disable


Will analysis the logs and will update the result.

Comment 5 Joseph Elwin Fernandes 2014-06-13 08:30:14 UTC

Looking at the logs these are the observation:
1) One of the volume brick Brick9: 10.70.36.231:/snapbrick1/d3r33 was down when the snapshot was taken. So snapshot create force command was used.

glusterd logs:
[2014-06-03 11:46:50.272440] I [MSGID: 106005] [glusterd-handler.c:4126:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.36.231:/snapbrick1/d3r33 has disconnected from glusterd.
[2014-06-03 11:48:31.266782] W [glusterd-snapshot.c:1630:glusterd_snapshot_create_prevalidate] 0-management: brick 10.70.36.231:/snapbrick1/d3r33 is not started

Brick log: 
[2014-06-03 11:46:50.266276] W [glusterfsd.c:1182:cleanup_and_exit] (--> 0-: received signum (15), shutting down

2) The snapshot create commit on the local system was successful. 
3) The snapshot create commit on remote peer system was unsuccessful 
glusterd log:
[2014-06-03 11:48:46.638882] E [glusterd-mgmt.c:116:gd_mgmt_v3_collate_errors] 0-management: Commit failed on 10.70.36.237. Please check log file for details.
[2014-06-03 11:48:54.338041] E [glusterd-mgmt.c:116:gd_mgmt_v3_collate_errors] 0-management: Commit failed on 10.70.36.236. Please check log file for details.
[2014-06-03 11:48:54.338154] E [glusterd-mgmt.c:1173:glusterd_mgmt_v3_commit] 0-management: Commit failed on peers
[2014-06-03 11:48:54.338194] E [glusterd-mgmt.c:1894:glusterd_mgmt_v3_initiate_snap_phases] 0-management: Commit Op Failed

The reason why it failed can be found out if we find the glusterd/glusterfsd logs from the peer nodes, which are not attached with the bug

4) Once the remote peer commit failed, the unbarrier brick-op is called, which doesnt fail!
5) After that the snap volume quorum is check and it fails.
6) Post-validate is called and the cleanup is done

Anil, could you please provide the sos-report from the other nodes. So that we can pin point by snapshot create commit failed on those nodes.

Comment 6 Anil Shah 2014-06-17 10:33:15 UTC

Joseph I don't have sos-reports from other nodes.

Comment 7 Joseph Elwin Fernandes 2014-06-18 07:20:46 UTC

1) since the reason for the commit on the remote nodes is not known (due to absence of logs) we cannot pin point the reason for the failure.
2) The log/cli messaging can be improved here i.e there is not unbarriering fail but still we have the message "unbarriering failed with quorum failed" which is miss leading.

Comment 8 senaik 2014-06-23 06:37:15 UTC

Version: glusterfs 3.6.0.20 built on Jun 19 2014
=======

I got the following error message while attaching a new node to the cluster while snapshot create was in progress 

snapshot create: success: Snap snap4 created successfully
snapshot create: failed: glusterds are not in quorum
Snapshot command failed
snapshot create: success: Snap snap6 created successfully

All glusterds were up and running on the nodes , but still we get the message that glusterd quorum is not met. 

----------------Part of log---------------------

name:snapshot15.lab.eng.blr.redhat.com
[2014-06-23 06:03:31.887252] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: 7e97d0f0-8ae9-40eb-b822-952cc5a8dc46, host
name:10.70.44.54
[2014-06-23 06:03:32.166226] W [glusterd-utils.c:12909:glusterd_snap_quorum_check_for_create] 0-management: glusterds are not in quorum
[2014-06-23 06:03:32.166352] W [glusterd-utils.c:13058:glusterd_snap_quorum_check] 0-management: Quorum checkfailed during snapshot create command
[2014-06-23 06:03:32.166374] W [glusterd-mgmt.c:1846:glusterd_mgmt_v3_initiate_snap_phases] 0-management: quorum check failed
[2014-06-23 06:03:32.166416] W [glusterd-snapshot.c:7012:glusterd_snapshot_postvalidate] 0-management: Snapshot create post-validation failed
[2014-06-23 06:03:32.166433] W [glusterd-mgmt.c:248:gd_mgmt_v3_post_validate_fn] 0-management: postvalidate operation failed
[2014-06-23 06:03:32.166451] E [glusterd-mgmt.c:1335:glusterd_mgmt_v3_post_validate] 0-management: Post Validation failed for operation Snapshot on local node
[2014-06-23 06:03:32.166467] E [glusterd-mgmt.c:1944:glusterd_mgmt_v3_initiate_snap_phases] 0-management: Post Validation Failed
[2014-06-23 06:03:33.972792] I [glusterd-handshake.c:1014:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30000

Comment 9 Joseph Elwin Fernandes 2014-06-23 06:46:56 UTC

Seema could you please attach the SOS reports of all the nodes

Comment 10 Joseph Elwin Fernandes 2014-06-23 06:48:02 UTC

Sorry for removing the blocks, Adding it again

Comment 11 senaik 2014-06-23 07:42:06 UTC

sosreports for comment 8 : 
========================

http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/snapshots/1104478/

Comment 12 Joseph Elwin Fernandes 2014-06-23 10:45:06 UTC

1) The Issue in this bug is that the message "unbarrier brick opfailed with the error quorum is not met" is printed when actually 
   a. The commit as failed 
   b. The Unbarrier is not failed!
   c. The quorum check for a failed "SNAP VOLUME" is done - which is bond to fail. The quorum check for the "SNAP VOLUME" is not need when create commit is failed.
The fix for this bug is that we dont do a quorum check for a SNAP VOLUME whose commit has failed. And Print the correct message and not a Static message which is not related. 


2) The Bug 1085278 is not related to this issue. Reasons as follows,
    a. Investigation of the logs show that quorum of the MAIN VOLUME is failed just after the prevalidate(Please note: Pre-validate is passed)
    b. The scenario of this bug is different as we have not yet entered in commit phases.      
    c. We are not having any "Brick Ops Failed" for unbarriering. 
    
I do agree its a bug and needs to be investigated separately but its not related to this.

And This Bug should not block 1085278

Removing form the bug 1085278 blocks list.

Comment 13 senaik 2014-06-23 12:25:56 UTC

Raised another bz 1112250 to track the issue mentioned in Comment 8

Comment 14 Joseph Elwin Fernandes 2014-06-30 03:04:32 UTC

Fix submitted upstream:

Anand Avati 2014-06-24 04:53:30 EDT
REVIEW: http://review.gluster.org/8158 (glusterd/snapshot : Fixing Msging in glusterd_mgmt_v3_initiate_snap_phases) posted (#1) for review on master by Joseph Fernandes (josferna)

Comment 15 Anil Shah 2014-07-09 10:08:33 UTC

Also in logs i am able to see brick path like 
/var/run/gluster/snaps/5feab1d45a1c4f928bb0624425800fd3/brick9/napbrick1/d3r33

See the description logs for more details.

Comment 17 Anil Shah 2015-07-15 10:40:43 UTC

Since i am not able to see error "quorum check failed " with every snapshot failure.
Marking this bug verified on build glusterfs-3.7.1-9.el6rhs.x86_64

Comment 19 errata-xmlrpc 2015-07-29 04:32:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html

Note You need to log in before you can comment on or make changes to this bug.