Description of problem: 'service glusterd stop' stops only the management daemon and not all the gluster processes - which is how it was decided (as per bug 1152992 and the related doc bug 1184846). A separate script 'stop-all-gluster-processes.sh' was created to achieve the intent of stopping all gluster processes. That fails to work when there is more than one gsync process running. Version-Release number of selected component (if applicable): glusterfs-3.6.0.52-1 How reproducible: Always Steps to Reproduce: 1. Have geo-rep session(s) created between master and slave (any configuration) 2. After successfully establishing and starting session(s), do a "ps aux | grep gluster | grep gsync | awk {'print $2'} " to make sure that there is more than one gsync process running (or any related process for that matter) 3. Execute the script "/usr/share/glusterfs/scripts/stop-all-gluster-processes.sh" and that should go through without any errors. 4. Run the 'ps' command again (mentioned in step 2) to verify that the gluster processes have stopped. Actual results: The script fails when there is more than one process running. It errors out saying 'too many arguments' Expected results: The script should work irrespective of number of processes active at that time. Additional info: Having the 'kill' command in a loop, rather than a stand-alone one would help in fixing this. When the steps mentioned in the script are run manually (and individually) on each pid, the process (associated with that pid) does get killed. [root@dhcp43-154 ~]# service glusterd status glusterd (pid 28125) is running... [root@dhcp43-154 ~]# service glusterd stop [root@dhcp43-154 ~]# service glusterd status [ OK ] glusterd is stopped [root@dhcp43-154 ~]# [root@dhcp43-154 ~]# ps aux | grep gsync root 28764 0.0 0.0 103252 840 pts/0 S+ 17:52 0:00 grep gsync root 30471 0.0 0.1 801384 13376 ? Ssl Mar17 1:57 python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/rhs/brick1/d1 --path=/rhs/brick2/d1 -c /var/lib/glusterd/geo-replication/master_dhcp42-130_slave/gsyncd.conf --iprefix=/var :master --glusterd-uuid=508fa5df-0fb9-4bad-9878-7e05b25d4fd9 geoacc@dhcp42-130::slave -N -p --slave-id e896eed0-bddd-448b-bd58-dc2b34962aea --local-path /rhs/brick2/d1 --agent --rpc-fd 7,11,10,9 root 30472 0.2 0.2 1497500 17716 ? Sl Mar17 10:18 python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/rhs/brick1/d1 --path=/rhs/brick2/d1 -c /var/lib/glusterd/geo-replication/master_dhcp42-130_slave/gsyncd.conf --iprefix=/var :master --glusterd-uuid=508fa5df-0fb9-4bad-9878-7e05b25d4fd9 geoacc@dhcp42-130::slave -N -p --slave-id e896eed0-bddd-448b-bd58-dc2b34962aea --feedback-fd 13 --local-path /rhs/brick2/d1 --local-id .%2Frhs%2Fbrick2%2Fd1 --rpc-fd 10,9,7,11 --resource-remote ssh://geoacc@dhcp42-130:gluster://localhost:slave root 30482 0.0 0.0 62508 4344 ? S Mar17 0:02 ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-JdGxkw/f7cab8010385926c094e1701b0926f87.sock geoacc@dhcp42-130 /nonexistent/gsyncd --session-owner fcf732d1-81d6-42d1-8915-cc2107fd72f2 -N --listen --timeout 120 gluster://localhost:slave root 30492 0.4 0.7 567480 57520 ? Ssl Mar17 17:34 /usr/sbin/glusterfs --aux-gfid-mount --log-file=/var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Fgeoacc%4010.70.42.130%3Agluster%3A%2F%2F127.0.0.1%3Aslave.%2Frhs%2Fbrick2%2Fd1.gluster.log --volfile-server=localhost --volfile-id=master --client-pid=-1 /tmp/gsyncd-aux-mount-bHyf9N [root@dhcp43-154 ~]# ps aux | grep gsync | wc -l 5 [root@dhcp43-154 ~]# sh /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh sending SIGTERM to process 23366 /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh: line 9: kill: (23366) - No such process /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh: line 16: test: too many arguments sending SIGKILL to process 23366 /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh: line 25: kill: (23366) - No such process /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh: line 30: test: too many arguments [root@dhcp43-154 ~]# [root@dhcp43-154 ~]# ps aux | grep gsync | wc -l 5 [root@dhcp43-154 ~]# [root@dhcp43-154 ~]# ps aux | grep gsync root 28796 0.0 0.0 103252 844 pts/0 S+ 17:52 0:00 grep gsync root 30471 0.0 0.1 801384 15400 ? Ssl Mar17 1:57 python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/rhs/brick1/d1 --path=/rhs/brick2/d1 -c /var/lib/glusterd/geo-replication/master_dhcp42-130_slave/gsyncd.conf --iprefix=/var :master --glusterd-uuid=508fa5df-0fb9-4bad-9878-7e05b25d4fd9 geoacc@dhcp42-130::slave -N -p --slave-id e896eed0-bddd-448b-bd58-dc2b34962aea --local-path /rhs/brick2/d1 --agent --rpc-fd 7,11,10,9 root 30472 0.2 0.2 1497500 17716 ? Sl Mar17 10:18 python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/rhs/brick1/d1 --path=/rhs/brick2/d1 -c /var/lib/glusterd/geo-replication/master_dhcp42-130_slave/gsyncd.conf --iprefix=/var :master --glusterd-uuid=508fa5df-0fb9-4bad-9878-7e05b25d4fd9 geoacc@dhcp42-130::slave -N -p --slave-id e896eed0-bddd-448b-bd58-dc2b34962aea --feedback-fd 13 --local-path /rhs/brick2/d1 --local-id .%2Frhs%2Fbrick2%2Fd1 --rpc-fd 10,9,7,11 --resource-remote ssh://geoacc@dhcp42-130:gluster://localhost:slave root 30482 0.0 0.0 62508 4344 ? S Mar17 0:02 ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-JdGxkw/f7cab8010385926c094e1701b0926f87.sock geoacc@dhcp42-130 /nonexistent/gsyncd --session-owner fcf732d1-81d6-42d1-8915-cc2107fd72f2 -N --listen --timeout 120 gluster://localhost:slave root 30492 0.4 0.7 567480 57520 ? Ssl Mar17 17:34 /usr/sbin/glusterfs --aux-gfid-mount --log-file=/var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Fgeoacc%4010.70.42.130%3Agluster%3A%2F%2F127.0.0.1%3Aslave.%2Frhs%2Fbrick2%2Fd1.gluster.log --volfile-server=localhost --volfile-id=master --client-pid=-1 /tmp/gsyncd-aux-mount-bHyf9N [root@dhcp43-154 ~]# vi /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh [root@dhcp43-154 ~]# ########### Running the steps from the script, manually, on an individual pid ########### [root@dhcp43-154 ~]# cat /var/lib/glusterd/geo-replication/master_dhcp42-130_slave/ssh%3A%2F%2Fgeoacc%4010.70.42.130%3Agluster%3A%2F%2F127.0.0.1%3Aslave.pid 23366 [root@dhcp43-154 ~]# [root@dhcp43-154 ~]# ps aux | grep gluster | grep gsync | awk '{print $2}' 30471 30472 30482 30492 [root@dhcp43-154 ~]# [root@dhcp43-154 ~]# test -n 30471 [root@dhcp43-154 ~]# kill -KILL 30471 [root@dhcp43-154 ~]# ps aux | grep gluster | grep gsync | awk {'print $2'} 30472 30482 30492 [root@dhcp43-154 ~]#
Severity could be changed depending on the number of customers who are using this script. Choosing to take/not-take this bug in 3.0.4 will affect the outcome of related-doc-bug 1184846
Hi Sweta, I see this being added as a known issue for the 3.0.4 release. Could you please lead me to someone who can provide the info or can you please fill out the doc text field?
Please get in touch with Aravinda or Kotresh for the updated content that has to go in the doc.
Patch Sent Upstream: http://review.gluster.org/#/c/9970/
RCA: test -n takes single string as argument. In the script, if multiple pids are got, it fails with "Too many arguments". Documentation changes required irrespective of whether the patch is taken in for 3.04 or not: Present Documentation: ---------- In section: 4.1. Starting and Stopping the glusterd service Run the following command to stop glusterd manually. # service glusterd stop Note Stopping the glusterd service does not stop all the glusterprocesses such as glusterfsd, gsyncd etc. Execute the script stop-all-gluster- processes.sh available at the /usr/share/glusterfs/scripts directory to stop all the gluster processes. ------------------- Instead of directly killing geo-rep processes through script, we should say use geo-rep stop command to stop geo-replication if geo-rep is established that inturn kills all geo-rep processess and update status properly. Later use the "stop-all-gluster-processes.sh" script to kill all other gluster processess.
Kotresh, I see this bug added in the known issues tracker bug for the 3.0.4 release. Could you please change the doc type to known issue and fill out the doc text field?
Hi Kotresh, Please review the edited doc text for technical accuracy and sign off.
The doc text looks fine.
*** Bug 1014850 has been marked as a duplicate of this bug. ***
stop-all-gluster-processes.sh is not packaged with the glusterfs-3.7.0-2 version. Moving this bug to assigned state. [root@georep1 ~]# ls /usr/share/glusterfs/scripts/ generate-gfid-file.sh get-gfid.sh gsync-sync-gfid gsync-upgrade.sh post-upgrade-script-for-quota.sh pre-upgrade-script-for-quota.sh slave-upgrade.sh [root@georep1 ~]# rpm -qa | grep gluster | xargs rpm -ql | grep stop-all-gluster-processes.sh* [root@georep1 ~]#
Patches: master: http://review.gluster.org/10931 release-3.7: http://review.gluster.org/11015 downstream: https://code.engineering.redhat.com/gerrit/#/c/49676/
Verified with build: glusterfs-3.7.1-7.el6rhs.x86_64 [root@georep1 ~]# ps aux | grep gsync | awk {'print $2'} 16277 16327 16328 16344 16363 16449 16450 16463 16469 18137 18138 18159 [root@georep1 ~]# bash /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh sending SIGTERM to process 16277 sending SIGTERM to process 16563 sending SIGTERM to process 16618 sending SIGTERM to process 15420 sending SIGTERM to process 15402 sending SIGTERM to process 16610 sending SIGTERM to process 15656 sending SIGTERM to process 16092 sending SIGKILL to process 16277 [root@georep1 ~]# [root@georep1 ~]# ps aux | grep gsync root 18207 0.0 0.0 103308 852 pts/0 S+ 16:20 0:00 grep gsync [root@georep1 ~]# Moving the bug to verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html