Bug 1122064 - [SNAPSHOT]: activate and deactivate doesn't do a handshake when a glusterd comes back
Summary: [SNAPSHOT]: activate and deactivate doesn't do a handshake when a glusterd co...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: snapshot
Version: rhgs-3.0
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ---
: RHGS 3.1.0
Assignee: Mohammed Rafi KC
QA Contact: Rahul Hinduja
URL:
Whiteboard: SNAPSHOT
Depends On:
Blocks: 1087818 1122377 1202842 1219744 1223636
TreeView+ depends on / blocked
 
Reported: 2014-07-22 13:09 UTC by Rahul Hinduja
Modified: 2016-09-17 12:59 UTC (History)
9 users (show)

Fixed In Version: glusterfs-3.7.0-3.el6rhs
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1122377 (view as bug list)
Environment:
Last Closed: 2015-07-29 04:34:27 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:1495 normal SHIPPED_LIVE Important: Red Hat Gluster Storage 3.1 update 2015-07-29 08:26:26 UTC

Description Rahul Hinduja 2014-07-22 13:09:57 UTC
Description of problem:
=======================

If a glusterd is down and snapshot is deactivated or activated from different nodes, activation and deactivation is successful. But when a node comes back the status of the snap is not updated to the node, it remains as it was before going down.

For example:
============

>> Status of the snapshot initially from all the nodes: Started

[root@inception ~]# gluster snapshot info RS2 | grep "Status"
        Status                    : Started
[root@inception ~]# 

[root@rhs-arch-srv2 ~]# gluster snapshot info RS2 | grep "Status"
        Status                    : Started
[root@rhs-arch-srv2 ~]# 

>> Stop the glusterd on one of the node:

[root@rhs-arch-srv2 ~]# service glusterd status
glusterd is stopped
[root@rhs-arch-srv2 ~]# 

>> Deactivate the snapshot from one of the node

[root@inception ~]# gluster snapshot deactivate RS2
Deactivating snap will make its data inaccessible. Do you want to continue? (y/n) y
Snapshot deactivate: RS2: Snap deactivated successfully
[root@inception ~]# 

>> Status on all the machines which are UP is deactivated

[root@inception ~]# gluster snapshot info RS2 | grep "Status"
        Status                    : Stopped
[root@inception ~]# 

>> Bring back the node which is down:

[root@rhs-arch-srv2 ~]# service glusterd status
glusterd (pid  20450) is running...
[root@rhs-arch-srv2 ~]# 

>> Check the status of snap on all the nodes, it is deactivated on all the nodes except the node which was down and came back

[root@inception ~]# gluster snapshot info RS2 | grep "Status"
        Status                    : Stopped
[root@inception ~]# 


[root@rhs-arch-srv2 ~]# gluster snapshot info RS2 | grep "Status"
        Status                    : Started
[root@rhs-arch-srv2 ~]# 



Version-Release number of selected component (if applicable):
==============================================================

glusterfs-3.6.0.25-1.el6rhs.x86_64


How reproducible:
=================
1/1


Steps to Reproduce:
===================
1. Have a volume from multi cluster node
2. Create a snapshot 
3. Bring down one of the node in cluster
4. Deactivate the snapshot, which should be successful
5. Bring back the node UP
6. Check the status of the snapshot

Actual results:
===============

It is deactivated on all the nodes except the node which is brought UP.


Expected results:
=================
Once the node is brought online, the handshake should be performed to set the correct status


Additional info:
=================

The only way to get the correct status is to activate the snapshot again using force and than deactivate when all nodes are UP

Comment 2 Vivek Agarwal 2014-07-24 07:05:51 UTC
Based on discussion, removing the blocker flag from this

Comment 3 Shalaka 2014-09-21 04:23:28 UTC
Please review and sign-off edited doc text.

Comment 4 senaik 2014-11-19 07:52:34 UTC
Version : glusterfs 3.6.0.33

With the latest change that snapshots are going to be deactivated by default and we need to activate them specifically before using it, this bug takes higher priority to be fixed.

Comment 5 Rahul Hinduja 2014-12-16 07:26:21 UTC
Scenario for comment 4

1. Create 4 node cluster
2. Create 6*2 volume
3. Start the volume
4. Create a snapshot of a volume (snap1)
5. Kill glusterd on node2
6. Activate the snapshot snap1
7. Activating snapshot should be successful and it should bring 9 brick process from node1,node3 and node4 to Online
8. Bring back the glusterd on node2
9. Once the glusterd comes back on node2, it doesn't start the snapshot brick process on node2

Network fluctuation, glusterd going down is a valid use case And activating/deactivating snapshot during that period will lead into inconsistent states of snapshots. Chances of hitting this now is very high.

One way of preventing for this release is to not allow activate/deactivate if a node/glusterd is down until user explicitly issues activate/deactivate force.

Comment 6 Mohammed Rafi KC 2015-03-26 10:05:14 UTC
RCA:
During handshake of glusterd, we are not checking the version of snaps. If there is any change made to snap, the version will be incremented. So during handshake we have to do a check for version of peer snap and local snap. If version of snap details in local host is a lesser than peer data, then the data in local host must be updated.


upstream patch : http://review.gluster.org/#/c/9664/

Comment 8 senaik 2015-06-23 09:49:04 UTC
Version :glusterfs-3.7.1-4.el6rhs.x86_64
========
Create a snapshot. It is deactivated by default 
Stop glusterd on node2
Activate the snapshot from Node1 - successful
Bring back glusterd on Node2
Check gluster snapshot info from Node2 - Snapshot status shows 'Started'

Bring down glusterd on Node4 while deactivating activated snapshot and check on Node4 when glusterd comes back up- gluster snapshot info shows Status 'Stopped' and status shows all bricks are not running 

Above is as expected.

When a node is brought down and snapshot is activated when the node comes back the snapshot info still shows 'Stopped' and status shows bricks are not running

Snapshot info from other nodes :
===============================
gluster snapshot info Snap2_GMT-2015.06.23-09.37.26
Snapshot                  : Snap2_GMT-2015.06.23-09.37.26
Snap UUID                 : 5961a313-62ea-41d0-8cad-0a8a0fafe766
Created                   : 2015-06-23 09:37:26
Snap Volumes:

	Snap Volume Name          : 3ee8f93e484540dcae8d55a64702e961
	Origin Volume name        : vol0
	Snaps taken for vol0      : 2
	Snaps available for vol0  : 1
	Status                    : Started


Node2 (which was rebooted)
==========================
 gluster snapshot info Snap2_GMT-2015.06.23-09.37.26
Snapshot                  : Snap2_GMT-2015.06.23-09.37.26
Snap UUID                 : 5961a313-62ea-41d0-8cad-0a8a0fafe766
Created                   : 2015-06-23 09:37:26
Snap Volumes:

	Snap Volume Name          : 3ee8f93e484540dcae8d55a64702e961
	Origin Volume name        : vol0
	Snaps taken for vol0      : 2
	Snaps available for vol0  : 1
	Status                    : Stopped

[root@rhs-arch-srv2 ~]# gluster snapshot status Snap2_GMT-2015.06.23-09.37.26

Snap Name : Snap2_GMT-2015.06.23-09.37.26
Snap UUID : 5961a313-62ea-41d0-8cad-0a8a0fafe766

	Brick Path        :   inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick1/b1
	Volume Group      :   RHS_vg1
	Brick Running     :   Yes
	Brick PID         :   7536
	Data Percentage   :   0.05
	LV Size           :   1.80t


	Brick Path        :   rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick2/b1
	Volume Group      :   RHS_vg1
	Brick Running     :   No
	Brick PID         :   N/A
	Data Percentage   :   0.13
	LV Size           :   29.66g


	Brick Path        :   rhs-arch-srv3.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick3/b1
	Volume Group      :   RHS_vg1
	Brick Running     :   Yes
	Brick PID         :   14376
	Data Percentage   :   0.13
	LV Size           :   29.66g


	Brick Path        :   rhs-arch-srv4.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick4/b1
	Volume Group      :   RHS_vg1
	Brick Running     :   Yes
	Brick PID         :   7975
	Data Percentage   :   0.13
	LV Size           :   29.66g


	Brick Path        :   inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick5/b2
	Volume Group      :   RHS_vg2
	Brick Running     :   Yes
	Brick PID         :   7554
	Data Percentage   :   0.05
	LV Size           :   1.80t


	Brick Path        :   rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick6/b2
	Volume Group      :   RHS_vg2
	Brick Running     :   No
	Brick PID         :   N/A
	Data Percentage   :   0.05
	LV Size           :   1.80t


	Brick Path        :   rhs-arch-srv3.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick7/b2
	Volume Group      :   RHS_vg2
	Brick Running     :   Yes
	Brick PID         :   14394
	Data Percentage   :   0.05
	LV Size           :   1.80t


	Brick Path        :   rhs-arch-srv4.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick8/b2
	Volume Group      :   RHS_vg2
	Brick Running     :   Yes
	Brick PID         :   7993
	Data Percentage   :   0.03
	LV Size           :   7.26t


	Brick Path        :   inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick9/b3
	Volume Group      :   RHS_vg3
	Brick Running     :   Yes
	Brick PID         :   7572
	Data Percentage   :   0.05
	LV Size           :   1.80t


	Brick Path        :   rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick10/b3
	Volume Group      :   RHS_vg3
	Brick Running     :   No
	Brick PID         :   N/A
	Data Percentage   :   0.03
	LV Size           :   7.26t


	Brick Path        :   rhs-arch-srv3.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick11/b3
	Volume Group      :   RHS_vg3
	Brick Running     :   Yes
	Brick PID         :   14412
	Data Percentage   :   0.03
	LV Size           :   7.26t


	Brick Path        :   rhs-arch-srv4.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick12/b4
	Volume Group      :   RHS_vg4
	Brick Running     :   Yes
	Brick PID         :   8011
	Data Percentage   :   0.03
	LV Size           :   7.26t


	Brick Path        :   inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick13/b5
	Volume Group      :   RHS_vg5
	Brick Running     :   Yes
	Brick PID         :   7590
	Data Percentage   :   0.05
	LV Size           :   1.80t


	Brick Path        :   rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick14/b5
	Volume Group      :   RHS_vg5
	Brick Running     :   No
	Brick PID         :   N/A
	Data Percentage   :   0.03
	LV Size           :   7.26t


	Brick Path        :   rhs-arch-srv3.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick15/b5
	Volume Group      :   RHS_vg5
	Brick Running     :   Yes
	Brick PID         :   14430
	Data Percentage   :   0.03
	LV Size           :   7.26t


	Brick Path        :   rhs-arch-srv4.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick16/b5
	Volume Group      :   RHS_vg5
	Brick Running     :   Yes
	Brick PID         :   8029
	Data Percentage   :   0.04
	LV Size           :   5.44t


	Brick Path        :   inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick17/b6
	Volume Group      :   RHS_vg6
	Brick Running     :   Yes
	Brick PID         :   7608
	Data Percentage   :   0.05
	LV Size           :   1.80t


	Brick Path        :   rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick18/b6
	Volume Group      :   RHS_vg6
	Brick Running     :   No
	Brick PID         :   N/A
	Data Percentage   :   0.04
	LV Size           :   5.44t


The above case fails in a Node down scenario. Moving back to 'Assigned'

Comment 9 Mohammed Rafi KC 2015-07-01 08:23:42 UTC
I tested the above case with 2*2 volume and it is working fine. There is a short delay to start the bricks after nodes come back to online. If you check the snapshot status at that time, it will show as offline.



tested using latest available downstream build

glusterfs-debuginfo-3.7.1-6.el6rhs.x86_64
glusterfs-client-xlators-3.7.1-6.el6rhs.x86_64
glusterfs-server-3.7.1-6.el6rhs.x86_64
glusterfs-rdma-3.7.1-6.el6rhs.x86_64
glusterfs-3.7.1-6.el6rhs.x86_64
glusterfs-api-3.7.1-6.el6rhs.x86_64
glusterfs-cli-3.7.1-6.el6rhs.x86_64
glusterfs-devel-3.7.1-6.el6rhs.x86_64
glusterfs-geo-replication-3.7.1-6.el6rhs.x86_64
glusterfs-libs-3.7.1-6.el6rhs.x86_64
glusterfs-fuse-3.7.1-6.el6rhs.x86_64
glusterfs-api-devel-3.7.1-6.el6rhs.x86_64

Comment 10 senaik 2015-07-01 13:21:54 UTC
Version : glusterfs-3.7.1-6.el6rhs.x86_64
=======
Retried scenario as mentioned in Comment 8 and Description. Snapshot status shows started and all bricks are running after node reboot. 

Waited for a while before checking the status after node rebooted. 

Marking bug Verified

Comment 12 errata-xmlrpc 2015-07-29 04:34:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html


Note You need to log in before you can comment on or make changes to this bug.