REVIEW: http://review.gluster.org/9307 (USS : Kill snapd during glusterd restart if USS is disabled) posted (#1) for review on release-3.6 by Sachin Pandit (spandit)
REVIEW: http://review.gluster.org/9307 (USS : Kill snapd during glusterd restart if USS is disabled.) posted (#2) for review on release-3.6 by Sachin Pandit (spandit)
REVIEW: http://review.gluster.org/9307 (USS : Kill snapd during glusterd restart if USS is disabled) posted (#3) for review on release-3.6 by Sachin Pandit (spandit)
COMMIT: http://review.gluster.org/9307 committed in release-3.6 by Raghavendra Bhat (raghavendra) ------ commit 9f0589646b4932b33ac0a913b1a23d8f279faf2b Author: Sachin Pandit <spandit> Date: Wed Nov 5 11:09:59 2014 +0530 USS : Kill snapd during glusterd restart if USS is disabled Problem : When glusterd is down on one of the nodes and during that time if USS is disabled then snapd will still be running in the node where glusterd was down. Solution : during restart of glusterd check if USS is disabled, if so then issue a kill for snapd. NOTE : The test case which I wrote in my previous patchset is facing some spurious failures, hence I thought of removing that test case. I'll add the test case once the issue is resolved. Change-Id: I2870ebb4b257d863cdfc319e8485b19e932576e9 BUG: 1175735 Signed-off-by: Sachin Pandit <spandit> Reviewed-on: http://review.gluster.org/9062 Reviewed-by: Rajesh Joseph <rjoseph> Reviewed-by: Avra Sengupta <asengupt> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Krishnan Parthasarathi <kparthas> Tested-by: Krishnan Parthasarathi <kparthas> Signed-off-by: Sachin Pandit <spandit> Reviewed-on: http://review.gluster.org/9307 Reviewed-by: Raghavendra Bhat <raghavendra>
Description of problem: ======================= When uss is enabled, it starts snapd on all the machines in the cluster. But in a scenario where user tries to disable the uss and at the same time glusterd goes down, the uss gets disabled but the snapd process is alive on the system where glusterd went down. This is expected. But when the glusterd comes back the snapd is still live whereas the uss is disabled. For example: ============ Uss is disabled and no snapd process running on any machines: ============================================================ [root@inception ~]# gluster v i vol3 | grep uss features.uss: off [root@inception ~]# ps -eaf | grep snapd root 2299 26954 0 18:05 pts/0 00:00:00 grep snapd [root@inception ~]# Enable the uss and snapd process should run on all the machines: ================================================================ [root@inception ~]# gluster v set vol3 uss on volume set: success [root@inception ~]# gluster v i vol3 | grep uss features.uss: on [root@inception ~]# [root@inception ~]# gluster v status vol3 | grep -i "snapshot daemon" Snapshot Daemon on localhost 49158 Y 2322 Snapshot Daemon on hostname1 49157 Y 3868 Snapshot Daemon on hostname2 49157 Y 3731 Snapshot Daemon on hostname3 49157 Y 3265 [root@inception ~]# Now, disable the USS and at the same time stop the glusterd on multiple machines: ======================================================================== [root@inception ~]# gluster v set vol3 uss off volume set: success [root@inception ~]# gluster v status vol3 | grep -i "snapshot daemon" [root@inception ~]# gluster v status vol3 Status of volume: vol3 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick hostname1:/rhs/brick4/b4 49155 Y 32406 NFS Server on localhost 2049 Y 2431 Self-heal Daemon on localhost N/A Y 2202 Task Status of Volume vol3 ------------------------------------------------------------------------------ There are no active volume tasks [root@inception ~]# snapd should not be running on machine where glusterd is UP but should be running on machines where glusterds are down: ========================================================================== Node1: ====== [root@inception ~]# ps -eaf | grep snapd root 2501 26954 0 18:11 pts/0 00:00:00 grep snapd [root@inception ~]# Node2: ====== [root@rhs-arch-srv2 ~]# ps -eaf | grep snapd root 3868 1 0 12:36 ? 00:00:00 /usr/sbin/glusterfsd -s localhost --volfile-id snapd/vol3 -p /var/lib/glusterd/vols/vol3/run/vol3-snapd.pid -l /var/log/glusterfs/vol3-snapd.log --brick-name snapd-vol3 -S /var/run/c01a04ffff6172926bfc0364bd457af3.socket --brick-port 49157 --xlator-option vol3-server.listen-port=49157 root 4163 5023 0 12:41 pts/0 00:00:00 grep snapd [root@rhs-arch-srv2 ~]# Node3: ====== [root@rhs-arch-srv3 ~]# ps -eaf | grep snapd root 3731 1 0 12:35 ? 00:00:00 /usr/sbin/glusterfsd -s localhost --volfile-id snapd/vol3 -p /var/lib/glusterd/vols/vol3/run/vol3-snapd.pid -l /var/log/glusterfs/vol3-snapd.log --brick-name snapd-vol3 -S /var/run/79af174d6c9c86897e0ff72f002994f2.socket --brick-port 49157 --xlator-option vol3-server.listen-port=49157 root 4028 5029 0 12:40 pts/0 00:00:00 grep snapd [root@rhs-arch-srv3 ~]# Node4: ======= [root@rhs-arch-srv4 ~]# ps -eaf | grep snapd root 3265 1 0 12:36 ? 00:00:00 /usr/sbin/glusterfsd -s localhost --volfile-id snapd/vol3 -p /var/lib/glusterd/vols/vol3/run/vol3-snapd.pid -l /var/log/glusterfs/vol3-snapd.log --brick-name snapd-vol3 -S /var/run/4bd0ff786ad2fc2b7e504182d985b723.socket --brick-port 49157 --xlator-option vol3-server.listen-port=49157 root 3587 4733 0 12:41 pts/0 00:00:00 grep snapd [root@rhs-arch-srv4 ~]# Start the glusterd on machines where it was stopped and look for snapd process, it is still running. Ran the same case with different scenario for bringing down the volume at the same time bring down the glusterd. In that case when the glusterd comes online, the brick process gets killed. Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.6.1 How reproducible: ================= always Actual results: =============== snapd process is online though for user uss is off Expected results: ================= snapd process should be killed --- Additional comment from Rahul Hinduja on 2014-10-30 08:51:20 EDT --- Additional info: ================ Lets say now, you enable the uss on the same volume, than the ports are shown as N/A for all the servers which were brought online [root@inception ~]# gluster v status vol3 | grep -i "snapshot daemon" Snapshot Daemon on localhost 49159 Y 2716 Snapshot Daemon on hostname1 N/A Y 3265 Snapshot Daemon on hostname2 N/A Y 3868 Snapshot Daemon on hostname3 N/A Y 3731 [root@inception ~]#
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.6.2, please reopen this bug report. glusterfs-3.6.2 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should already be or become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. The fix for this bug likely to be included in all future GlusterFS releases i.e. release > 3.6.2. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/5978 [2] http://news.gmane.org/gmane.comp.file-systems.gluster.user [3] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137