Description of problem: ======================= When uss is enabled, it starts snapd on all the machines in the cluster. But in a scenario where user tries to disable the uss and at the same time glusterd goes down, the uss gets disabled but the snapd process is alive on the system where glusterd went down. This is expected. But when the glusterd comes back the snapd is still live whereas the uss is disabled. For example: ============ Uss is disabled and no snapd process running on any machines: ============================================================ [root@inception ~]# gluster v i vol3 | grep uss features.uss: off [root@inception ~]# ps -eaf | grep snapd root 2299 26954 0 18:05 pts/0 00:00:00 grep snapd [root@inception ~]# Enable the uss and snapd process should run on all the machines: ================================================================ [root@inception ~]# gluster v set vol3 uss on volume set: success [root@inception ~]# gluster v i vol3 | grep uss features.uss: on [root@inception ~]# [root@inception ~]# gluster v status vol3 | grep -i "snapshot daemon" Snapshot Daemon on localhost 49158 Y 2322 Snapshot Daemon on rhs-arch-srv2.lab.eng.blr.redhat.com 49157 Y 3868 Snapshot Daemon on rhs-arch-srv3.lab.eng.blr.redhat.com 49157 Y 3731 Snapshot Daemon on rhs-arch-srv4.lab.eng.blr.redhat.com 49157 Y 3265 [root@inception ~]# Now, disable the USS and at the same time stop the glusterd on multiple machines: ======================================================================== [root@inception ~]# gluster v set vol3 uss off volume set: success [root@inception ~]# gluster v status vol3 | grep -i "snapshot daemon" [root@inception ~]# gluster v status vol3 Status of volume: vol3 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick inception.lab.eng.blr.redhat.com:/rhs/brick4/b4 49155 Y 32406 NFS Server on localhost 2049 Y 2431 Self-heal Daemon on localhost N/A Y 2202 Task Status of Volume vol3 ------------------------------------------------------------------------------ There are no active volume tasks [root@inception ~]# snapd should not be running on machine where glusterd is UP but should be running on machines where glusterds are down: ========================================================================== Node1: ====== [root@inception ~]# ps -eaf | grep snapd root 2501 26954 0 18:11 pts/0 00:00:00 grep snapd [root@inception ~]# Node2: ====== [root@rhs-arch-srv2 ~]# ps -eaf | grep snapd root 3868 1 0 12:36 ? 00:00:00 /usr/sbin/glusterfsd -s localhost --volfile-id snapd/vol3 -p /var/lib/glusterd/vols/vol3/run/vol3-snapd.pid -l /var/log/glusterfs/vol3-snapd.log --brick-name snapd-vol3 -S /var/run/c01a04ffff6172926bfc0364bd457af3.socket --brick-port 49157 --xlator-option vol3-server.listen-port=49157 root 4163 5023 0 12:41 pts/0 00:00:00 grep snapd [root@rhs-arch-srv2 ~]# Node3: ====== [root@rhs-arch-srv3 ~]# ps -eaf | grep snapd root 3731 1 0 12:35 ? 00:00:00 /usr/sbin/glusterfsd -s localhost --volfile-id snapd/vol3 -p /var/lib/glusterd/vols/vol3/run/vol3-snapd.pid -l /var/log/glusterfs/vol3-snapd.log --brick-name snapd-vol3 -S /var/run/79af174d6c9c86897e0ff72f002994f2.socket --brick-port 49157 --xlator-option vol3-server.listen-port=49157 root 4028 5029 0 12:40 pts/0 00:00:00 grep snapd [root@rhs-arch-srv3 ~]# Node4: ======= [root@rhs-arch-srv4 ~]# ps -eaf | grep snapd root 3265 1 0 12:36 ? 00:00:00 /usr/sbin/glusterfsd -s localhost --volfile-id snapd/vol3 -p /var/lib/glusterd/vols/vol3/run/vol3-snapd.pid -l /var/log/glusterfs/vol3-snapd.log --brick-name snapd-vol3 -S /var/run/4bd0ff786ad2fc2b7e504182d985b723.socket --brick-port 49157 --xlator-option vol3-server.listen-port=49157 root 3587 4733 0 12:41 pts/0 00:00:00 grep snapd [root@rhs-arch-srv4 ~]# Start the glusterd on machines where it was stopped and look for snapd process, it is still running. Ran the same case with different scenario for bringing down the volume at the same time bring down the glusterd. In that case when the glusterd comes online, the brick process gets killed. Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.6.0.30-1.el6rhs.x86_64 How reproducible: ================= always Actual results: =============== snapd process is online though for user uss is off Expected results: ================= snapd process should be killed
Additional info: ================ Lets say now, you enable the uss on the same volume, than the ports are shown as N/A for all the servers which were brought online [root@inception ~]# gluster v status vol3 | grep -i "snapshot daemon" Snapshot Daemon on localhost 49159 Y 2716 Snapshot Daemon on rhs-arch-srv4.lab.eng.blr.redhat.com N/A Y 3265 Snapshot Daemon on rhs-arch-srv2.lab.eng.blr.redhat.com N/A Y 3868 Snapshot Daemon on rhs-arch-srv3.lab.eng.blr.redhat.com N/A Y 3731 [root@inception ~]#
This issue is resolved, and the patch which fixes the issue is reviewed upstream, we are waiting for the regression to pass, so that it can be merged upstream. After that I'll send a relevant patch downstream.
https://code.engineering.redhat.com/gerrit/#/c/36772/ fixes the issue
Verified the bug with the following gluster version and did not find the issue. Marking the Bug as VERIFIED. [root@dhcp42-244 yum.repos.d]# rpm -qa | grep glusterfs samba-glusterfs-3.6.509-169.1.el6rhs.x86_64 glusterfs-3.6.0.33-1.el6rhs.x86_64 glusterfs-rdma-3.6.0.33-1.el6rhs.x86_64 glusterfs-cli-3.6.0.33-1.el6rhs.x86_64 glusterfs-libs-3.6.0.33-1.el6rhs.x86_64 glusterfs-api-3.6.0.33-1.el6rhs.x86_64 glusterfs-server-3.6.0.33-1.el6rhs.x86_64 glusterfs-geo-replication-3.6.0.33-1.el6rhs.x86_64 glusterfs-fuse-3.6.0.33-1.el6rhs.x86_64 [root@dhcp42-244 yum.repos.d]# service glusterd start Starting glusterd: [ OK ] [root@dhcp42-244 yum.repos.d]# gluster peer status Number of Peers: 3 Hostname: 10.70.43.6 Uuid: 2c0d5fe8-a014-4978-ace7-c663e4cc8d91 State: Peer in Cluster (Connected) Hostname: 10.70.42.204 Uuid: 2a2a1b36-37e3-4336-b82a-b09dcc2f745e State: Peer in Cluster (Connected) Hostname: 10.70.42.10 Uuid: 77c49bfc-6cb4-44f3-be12-41447a3a452e State: Peer in Cluster (Connected) [root@dhcp42-244 yum.repos.d]# [root@dhcp42-244 yum.repos.d]# gluster volume info Volume Name: testvol Type: Distributed-Replicate Volume ID: 60c63773-39e8-4145-9985-5bcedf59cd1b Status: Started Snap Volume: no Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.42.244:/rhs/brick1/testvol Brick2: 10.70.43.6:/rhs/brick2/testvol Brick3: 10.70.42.204:/rhs/brick3/testvol Brick4: 10.70.42.10:/rhs/brick4/testvol Options Reconfigured: performance.readdir-ahead: on auto-delete: disable snap-max-soft-limit: 90 snap-max-hard-limit: 256 Volume Name: testvol1 Type: Distributed-Replicate Volume ID: bcd90c32-e79d-4197-a5b2-b0ea1d52002d Status: Started Snap Volume: no Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.42.244:/rhs/brick2/testvol Brick2: 10.70.43.6:/rhs/brick3/testvol Brick3: 10.70.42.204:/rhs/brick4/testvol Brick4: 10.70.42.10:/rhs/brick1/testvol Options Reconfigured: performance.readdir-ahead: on auto-delete: disable snap-max-soft-limit: 90 snap-max-hard-limit: 256 [root@dhcp42-244 yum.repos.d]# gluster volume status Status of volume: testvol Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.42.244:/rhs/brick1/testvol 49152 Y 28796 Brick 10.70.43.6:/rhs/brick2/testvol 49152 Y 28582 Brick 10.70.42.204:/rhs/brick3/testvol 49152 Y 28859 Brick 10.70.42.10:/rhs/brick4/testvol 49152 Y 25645 NFS Server on localhost 2049 Y 28810 Self-heal Daemon on localhost N/A Y 28815 NFS Server on 10.70.43.6 2049 Y 28596 Self-heal Daemon on 10.70.43.6 N/A Y 28601 NFS Server on 10.70.42.10 2049 Y 25660 Self-heal Daemon on 10.70.42.10 N/A Y 25665 NFS Server on 10.70.42.204 2049 Y 28873 Self-heal Daemon on 10.70.42.204 N/A Y 28878 Task Status of Volume testvol ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: testvol1 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.42.244:/rhs/brick2/testvol 49153 Y 28801 Brick 10.70.43.6:/rhs/brick3/testvol 49153 Y 28589 Brick 10.70.42.204:/rhs/brick4/testvol 49153 Y 28866 Brick 10.70.42.10:/rhs/brick1/testvol 49153 Y 25653 NFS Server on localhost 2049 Y 28810 Self-heal Daemon on localhost N/A Y 28815 NFS Server on 10.70.42.10 2049 Y 25660 Self-heal Daemon on 10.70.42.10 N/A Y 25665 NFS Server on 10.70.43.6 2049 Y 28596 Self-heal Daemon on 10.70.43.6 N/A Y 28601 NFS Server on 10.70.42.204 2049 Y 28873 Self-heal Daemon on 10.70.42.204 N/A Y 28878 Task Status of Volume testvol1 ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp42-244 yum.repos.d]# [root@dhcp42-244 yum.repos.d]# ps -aef | grep glusterfs* root 28796 1 0 00:29 ? 00:00:00 /usr/sbin/glusterfsd -s 10.70.42.244 --volfile-id testvol.10.70.42.244.rhs-brick1-testvol -p /var/lib/glusterd/vols/testvol/run/10.70.42.244-rhs-brick1-testvol.pid -S /var/run/5d2ea4e94d53cee919733c03d99598b3.socket --brick-name /rhs/brick1/testvol -l /var/log/glusterfs/bricks/rhs-brick1-testvol.log --xlator-option *-posix.glusterd-uuid=1ed937c4-aaba-4c64-abd8-556f37a63030 --brick-port 49152 --xlator-option testvol-server.listen-port=49152 root 28801 1 0 00:29 ? 00:00:00 /usr/sbin/glusterfsd -s 10.70.42.244 --volfile-id testvol1.10.70.42.244.rhs-brick2-testvol -p /var/lib/glusterd/vols/testvol1/run/10.70.42.244-rhs-brick2-testvol.pid -S /var/run/65406ee4edd7eb0b46d39e0a7738cf24.socket --brick-name /rhs/brick2/testvol -l /var/log/glusterfs/bricks/rhs-brick2-testvol.log --xlator-option *-posix.glusterd-uuid=1ed937c4-aaba-4c64-abd8-556f37a63030 --brick-port 49153 --xlator-option testvol1-server.listen-port=49153 root 28810 1 0 00:29 ? 00:00:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/c7d7a0963dade75bd42ba7eef07e657f.socket root 28815 1 0 00:29 ? 00:00:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/113cd33d135531e963db306d2e62da0f.socket --xlator-option *replicate*.node-uuid=1ed937c4-aaba-4c64-abd8-556f37a63030 root 28902 28035 0 00:32 pts/0 00:00:00 grep glusterfs* [root@dhcp42-244 yum.repos.d]# gluster snapshot list No snapshots present [root@dhcp42-244 yum.repos.d]# [root@dhcp42-244 ~]# gluster snapshot list snap1 snap2 [root@dhcp42-244 ~]# [root@dhcp42-244 ~]# ps -aef | grep snapd root 29633 28035 0 00:40 pts/0 00:00:00 grep snapd [root@dhcp42-244 ~]# [root@dhcp42-244 ~]# ps -ef | grep snapd root 29660 1 0 00:42 ? 00:00:00 /usr/sbin/glusterfsd -s localhost --volfile-id snapd/testvol -p /var/lib/glusterd/vols/testvol/run/testvol-snapd.pid -l /var/log/glusterfs/snaps/testvol/snapd.log --brick-name snapd-testvol -S /var/run/2b39eca5a85774c651a3ae045f4834e9.socket --brick-port 49154 --xlator-option testvol-server.listen-port=49154 root 29737 28035 0 00:43 pts/0 00:00:00 grep snapd [root@dhcp42-244 ~]# [root@dhcp42-244 ~]# gluster volume info testvol | grep uss features.uss: on [root@dhcp42-244 ~]# [root@dhcp42-244 ~]# gluster v status testvol | grep -i "snapshot daemon" Snapshot Daemon on localhost 49154 Y 29660 Snapshot Daemon on 10.70.42.204 49154 Y 29556 Snapshot Daemon on 10.70.42.10 49154 Y 26344 Snapshot Daemon on 10.70.43.6 49154 Y 29288 [root@dhcp42-244 ~]# [root@dhcp42-244 ~]# gluster volume set testvol features.uss off volume set: success [root@dhcp42-244 ~]# gluster volume info testvol | grep uss features.uss: off [root@dhcp42-244 ~]# gluster v status testvol | grep -i "snapshot daemon" [root@dhcp42-244 ~]# [root@dhcp42-244 ~]# gluster v status testvol | grep -i "snapshot daemon" Snapshot Daemon on localhost 49155 Y 29925 Snapshot Daemon on 10.70.43.6 49155 Y 29497 Snapshot Daemon on 10.70.42.10 49155 Y 26539 Snapshot Daemon on 10.70.42.204 49155 Y 29746 [root@dhcp43-6 yum.repos.d]# service glusterd stop [root@dhcp43-6 yum.repos.d]# [ OK ] Verify that the snapd is not running on the node where glusterd is down: ===================================================== [root@dhcp42-244 ~]# gluster v status testvol | grep -i "snapshot daemon" Snapshot Daemon on localhost 49155 Y 29925 Snapshot Daemon on 10.70.42.204 49155 Y 29746 Snapshot Daemon on 10.70.42.10 49155 Y 26539 [root@dhcp42-244 ~]# [root@dhcp42-244 ~]# gluster volume set testvol features.uss off volume set: success [root@dhcp42-244 ~]# gluster v status testvol | grep -i "snapshot daemon" [root@dhcp42-244 ~]# gluster volume set testvol features.uss on volume set: success [root@dhcp42-244 ~]# gluster v status testvol | grep -i "snapshot daemon" Snapshot Daemon on localhost 49156 Y 30116 Snapshot Daemon on 10.70.42.204 49156 Y 29886 Snapshot Daemon on 10.70.42.10 49156 Y 26679 [root@dhcp42-244 ~]# Restart the glusterd on the node where glusterd is stopped and verify the snapd is running on the node: ========================================================================= [root@dhcp43-6 yum.repos.d]# service glusterd start Starting glusterd: [ OK ] [root@dhcp43-6 yum.repos.d]# [root@dhcp42-244 ~]# gluster v status testvol | grep -i "snapshot daemon" Snapshot Daemon on localhost 49156 Y 30116 Snapshot Daemon on 10.70.42.10 49156 Y 26679 Snapshot Daemon on 10.70.42.204 49156 Y 29886 Snapshot Daemon on 10.70.43.6 49156 Y 29798 [root@dhcp42-244 ~]# [root@dhcp42-244 ~]# gluster volume set testvol features.uss off volume set: success [root@dhcp42-244 ~]# gluster v status testvol | grep -i "snapshot daemon" [root@dhcp42-244 ~]# [root@dhcp42-244 ~]# gluster snapshot status Snap Name : snap1 Snap UUID : 78cc0645-31f7-4b9c-8d4b-c0565247f84e Brick Path : 10.70.42.244:/var/run/gluster/snaps/623c4bb66e584122830e27bb9e512519/brick1/testvol Volume Group : RHS_vg1 Brick Running : No Brick PID : N/A Data Percentage : 0.20 LV Size : 13.47g Brick Path : 10.70.43.6:/var/run/gluster/snaps/623c4bb66e584122830e27bb9e512519/brick2/testvol Volume Group : RHS_vg2 Brick Running : No Brick PID : N/A Data Percentage : 0.20 LV Size : 13.47g Brick Path : 10.70.42.204:/var/run/gluster/snaps/623c4bb66e584122830e27bb9e512519/brick3/testvol Volume Group : RHS_vg3 Brick Running : No Brick PID : N/A Data Percentage : 0.20 LV Size : 13.47g Brick Path : 10.70.42.10:/var/run/gluster/snaps/623c4bb66e584122830e27bb9e512519/brick4/testvol Volume Group : RHS_vg4 Brick Running : No Brick PID : N/A Data Percentage : 0.20 LV Size : 13.47g Snap Name : snap2 Snap UUID : 3febe842-d07c-4b54-8e5c-d17c60c8e845 Brick Path : 10.70.42.244:/var/run/gluster/snaps/95b63de2c6af4c0d995b0012ffc5b60e/brick1/testvol Volume Group : RHS_vg1 Brick Running : No Brick PID : N/A Data Percentage : 0.20 LV Size : 13.47g Brick Path : 10.70.43.6:/var/run/gluster/snaps/95b63de2c6af4c0d995b0012ffc5b60e/brick2/testvol Volume Group : RHS_vg2 Brick Running : No Brick PID : N/A Data Percentage : 0.20 LV Size : 13.47g Brick Path : 10.70.42.204:/var/run/gluster/snaps/95b63de2c6af4c0d995b0012ffc5b60e/brick3/testvol Volume Group : RHS_vg3 Brick Running : No Brick PID : N/A Data Percentage : 0.20 LV Size : 13.47g Brick Path : 10.70.42.10:/var/run/gluster/snaps/95b63de2c6af4c0d995b0012ffc5b60e/brick4/testvol Volume Group : RHS_vg4 Brick Running : No Brick PID : N/A Data Percentage : 0.20 LV Size : 13.47g [root@dhcp42-244 ~]# gluster snapshot activate snap1 Snapshot activate: snap1: Snap activated successfully [root@dhcp42-244 ~]# gluster snapshot activate snap2 Snapshot activate: snap2: Snap activated successfully [root@dhcp42-244 ~]# [root@dhcp43-190 .snaps]# ls -lrt total 0 d---------. 0 root root 0 Dec 31 1969 snap2 d---------. 0 root root 0 Dec 31 1969 snap1 [root@dhcp43-190 .snaps]# [root@dhcp42-244 ~]# gluster volume stop testvol Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: testvol: success [root@dhcp42-244 ~]# gluster volume status Volume testvol is not started Status of volume: testvol1 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.42.244:/rhs/brick2/testvol 49153 Y 28801 Brick 10.70.43.6:/rhs/brick3/testvol 49153 Y 28589 Brick 10.70.42.204:/rhs/brick4/testvol 49153 Y 28866 Brick 10.70.42.10:/rhs/brick1/testvol 49153 Y 25653 NFS Server on localhost 2049 Y 31452 Self-heal Daemon on localhost N/A Y 31459 NFS Server on 10.70.42.204 2049 Y 31083 Self-heal Daemon on 10.70.42.204 N/A Y 31090 NFS Server on 10.70.42.10 2049 Y 27862 Self-heal Daemon on 10.70.42.10 N/A Y 27877 NFS Server on 10.70.43.6 2049 Y 31003 Self-heal Daemon on 10.70.43.6 N/A Y 31010 Task Status of Volume testvol1 ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp42-244 ~]# gluster volume start testvol volume start: testvol: success [root@dhcp42-244 ~]# gluster volume status Status of volume: testvol Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.42.244:/rhs/brick1/testvol 49152 Y 31493 Brick 10.70.43.6:/rhs/brick2/testvol 49152 Y 31033 Brick 10.70.42.204:/rhs/brick3/testvol 49152 Y 31107 Brick 10.70.42.10:/rhs/brick4/testvol 49152 Y 27893 Snapshot Daemon on localhost 49157 Y 31505 NFS Server on localhost 2049 Y 31512 Self-heal Daemon on localhost N/A Y 31523 Snapshot Daemon on 10.70.43.6 49157 Y 31045 NFS Server on 10.70.43.6 2049 Y 31052 Self-heal Daemon on 10.70.43.6 N/A N N/A Snapshot Daemon on 10.70.42.10 49157 Y 27905 NFS Server on 10.70.42.10 N/A N N/A Self-heal Daemon on 10.70.42.10 N/A N N/A Snapshot Daemon on 10.70.42.204 49157 Y 31119 NFS Server on 10.70.42.204 N/A N N/A Self-heal Daemon on 10.70.42.204 N/A N N/A Task Status of Volume testvol ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: testvol1 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.42.244:/rhs/brick2/testvol 49153 Y 28801 Brick 10.70.43.6:/rhs/brick3/testvol 49153 Y 28589 Brick 10.70.42.204:/rhs/brick4/testvol 49153 Y 28866 Brick 10.70.42.10:/rhs/brick1/testvol 49153 Y 25653 NFS Server on localhost 2049 Y 31512 Self-heal Daemon on localhost N/A Y 31523 NFS Server on 10.70.43.6 2049 Y 31052 Self-heal Daemon on 10.70.43.6 N/A N N/A NFS Server on 10.70.42.10 N/A N N/A Self-heal Daemon on 10.70.42.10 N/A N N/A NFS Server on 10.70.42.204 N/A N N/A Self-heal Daemon on 10.70.42.204 N/A N N/A Task Status of Volume testvol1 ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp42-244 ~]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0038.html