1175735 – [USS]: snapd process is not killed once the glusterd comes back

Bug 1175735 - [USS]: snapd process is not killed once the glusterd comes back

Summary: [USS]: snapd process is not killed once the glusterd comes back

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	3.6.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:	USS
Depends On:	1158883 1161015 1240338 1240952 1240955
Blocks:	glusterfs-3.6.2
TreeView+	depends on / blocked

Reported:	2014-12-18 13:49 UTC by Vijaikumar Mallikarjuna
Modified:	2016-05-11 22:47 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.6.2
Clone Of:	1161015
Environment:
Last Closed:	2015-02-11 09:11:13 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Anand Avati 2014-12-19 07:20:22 UTC

REVIEW: http://review.gluster.org/9307 (USS : Kill snapd during glusterd restart if USS is disabled) posted (#1) for review on release-3.6 by Sachin Pandit (spandit)

Comment 2 Anand Avati 2014-12-19 08:59:49 UTC

REVIEW: http://review.gluster.org/9307 (USS : Kill snapd during glusterd restart if USS is disabled.) posted (#2) for review on release-3.6 by Sachin Pandit (spandit)

Comment 3 Anand Avati 2014-12-24 07:17:27 UTC

REVIEW: http://review.gluster.org/9307 (USS : Kill snapd during glusterd restart if USS is disabled) posted (#3) for review on release-3.6 by Sachin Pandit (spandit)

Comment 4 Anand Avati 2014-12-24 11:38:01 UTC

COMMIT: http://review.gluster.org/9307 committed in release-3.6 by Raghavendra Bhat (raghavendra) 
------
commit 9f0589646b4932b33ac0a913b1a23d8f279faf2b
Author: Sachin Pandit <spandit>
Date:   Wed Nov 5 11:09:59 2014 +0530

    USS : Kill snapd during glusterd restart if USS is disabled
    
    Problem : When glusterd is down on one of the nodes and during that
    time if USS is disabled then snapd will still be running
    in the node where glusterd was down.
    
    Solution : during restart of glusterd check if USS is disabled,
    if so then issue a kill for snapd.
    
    NOTE : The test case which I wrote in my previous patchset
    is facing some spurious failures, hence I thought of removing
    that test case. I'll add the test case once the issue is resolved.
    
    Change-Id: I2870ebb4b257d863cdfc319e8485b19e932576e9
    BUG: 1175735
    Signed-off-by: Sachin Pandit <spandit>
    Reviewed-on: http://review.gluster.org/9062
    Reviewed-by: Rajesh Joseph <rjoseph>
    Reviewed-by: Avra Sengupta <asengupt>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Krishnan Parthasarathi <kparthas>
    Tested-by: Krishnan Parthasarathi <kparthas>
    Signed-off-by: Sachin Pandit <spandit>
    Reviewed-on: http://review.gluster.org/9307
    Reviewed-by: Raghavendra Bhat <raghavendra>

Comment 5 Raghavendra Bhat 2015-01-06 10:39:06 UTC

Description of problem:
=======================

When uss is enabled, it starts snapd on all the machines in the cluster. But in a scenario where user tries to disable the uss and at the same time glusterd goes down, the uss gets disabled but the snapd process is alive on the system where glusterd went down. This is expected. But when the glusterd comes back the snapd is still live whereas the uss is disabled. 

For example:
============

Uss is disabled and no snapd process running on any machines:
============================================================

[root@inception ~]# gluster v i vol3 | grep uss
features.uss: off
[root@inception ~]# ps -eaf | grep snapd
root      2299 26954  0 18:05 pts/0    00:00:00 grep snapd
[root@inception ~]# 


Enable the uss and snapd process should run on all the machines:
================================================================

[root@inception ~]# gluster v set vol3 uss on
volume set: success
[root@inception ~]# gluster v i vol3 | grep uss
features.uss: on
[root@inception ~]#
[root@inception ~]# gluster v status vol3 | grep -i "snapshot daemon"
Snapshot Daemon on localhost				49158	Y	2322
Snapshot Daemon on hostname1	49157	Y	3868
Snapshot Daemon on hostname2	49157	Y	3731
Snapshot Daemon on hostname3	49157	Y	3265
[root@inception ~]# 

Now, disable the USS and at the same time stop the glusterd on multiple machines:
========================================================================

[root@inception ~]# gluster v set vol3 uss off
volume set: success
[root@inception ~]# gluster v status vol3 | grep -i "snapshot daemon"
[root@inception ~]# gluster v status vol3
Status of volume: vol3
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick hostname1:/rhs/brick4/b4	49155	Y	32406
NFS Server on localhost					2049	Y	2431
Self-heal Daemon on localhost				N/A	Y	2202
 
Task Status of Volume vol3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@inception ~]# 


snapd should not be running on machine where glusterd is UP but should be running on machines where glusterds are down:
==========================================================================

Node1:
======

[root@inception ~]# ps -eaf | grep snapd
root      2501 26954  0 18:11 pts/0    00:00:00 grep snapd
[root@inception ~]# 

Node2:
======

[root@rhs-arch-srv2 ~]# ps -eaf | grep snapd
root      3868     1  0 12:36 ?        00:00:00 /usr/sbin/glusterfsd -s localhost --volfile-id snapd/vol3 -p /var/lib/glusterd/vols/vol3/run/vol3-snapd.pid -l /var/log/glusterfs/vol3-snapd.log --brick-name snapd-vol3 -S /var/run/c01a04ffff6172926bfc0364bd457af3.socket --brick-port 49157 --xlator-option vol3-server.listen-port=49157
root      4163  5023  0 12:41 pts/0    00:00:00 grep snapd
[root@rhs-arch-srv2 ~]# 


Node3:
======

[root@rhs-arch-srv3 ~]# ps -eaf | grep snapd
root      3731     1  0 12:35 ?        00:00:00 /usr/sbin/glusterfsd -s localhost --volfile-id snapd/vol3 -p /var/lib/glusterd/vols/vol3/run/vol3-snapd.pid -l /var/log/glusterfs/vol3-snapd.log --brick-name snapd-vol3 -S /var/run/79af174d6c9c86897e0ff72f002994f2.socket --brick-port 49157 --xlator-option vol3-server.listen-port=49157
root      4028  5029  0 12:40 pts/0    00:00:00 grep snapd
[root@rhs-arch-srv3 ~]# 


Node4:
=======

[root@rhs-arch-srv4 ~]# ps -eaf | grep snapd
root      3265     1  0 12:36 ?        00:00:00 /usr/sbin/glusterfsd -s localhost --volfile-id snapd/vol3 -p /var/lib/glusterd/vols/vol3/run/vol3-snapd.pid -l /var/log/glusterfs/vol3-snapd.log --brick-name snapd-vol3 -S /var/run/4bd0ff786ad2fc2b7e504182d985b723.socket --brick-port 49157 --xlator-option vol3-server.listen-port=49157
root      3587  4733  0 12:41 pts/0    00:00:00 grep snapd
[root@rhs-arch-srv4 ~]# 



Start the glusterd on machines where it was stopped and look for snapd process, it is still running.

Ran the same case with different scenario for bringing down the volume at the same time bring down the glusterd. In that case when the glusterd comes online, the brick process gets killed.


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.6.1


How reproducible:
=================

always


Actual results:
===============

snapd process is online though for user uss is off


Expected results:
=================

snapd process should be killed

--- Additional comment from Rahul Hinduja on 2014-10-30 08:51:20 EDT ---

Additional info:
================

Lets say now, you enable the uss on the same volume, than the ports are shown as N/A for all the servers which were brought online

[root@inception ~]# gluster v status vol3 | grep -i "snapshot daemon"
Snapshot Daemon on localhost				49159	Y	2716
Snapshot Daemon on hostname1	N/A	Y	3265
Snapshot Daemon on hostname2	N/A	Y	3868
Snapshot Daemon on hostname3	N/A	Y	3731
[root@inception ~]#

Comment 6 Raghavendra Bhat 2015-02-11 09:11:13 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.6.2, please reopen this bug report.

glusterfs-3.6.2 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should already be or become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

The fix for this bug likely to be included in all future GlusterFS releases i.e. release > 3.6.2.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/5978
[2] http://news.gmane.org/gmane.comp.file-systems.gluster.user
[3] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137

Note You need to log in before you can comment on or make changes to this bug.