Description of problem: The %post server of gluster.spec says: killall glusterd &> /dev/null glusterd --xlator-option *.upgrade=on -N This doesn't wait for the old glusterd to actually exit, so the new one sees it cannot bind to the interface and quits, and then the original one quits, leaving no glusterd actually running. Version-Release number of selected component (if applicable): glusterfs-3.5.1 How reproducible: Everytime Steps to Reproduce: 1. Run glusterd 2. Upgrade from 3.5.0 to 3.5.1 3. Actual results: No glusterd running anymore Expected results: An upgraded glusterd running Additional info:
The post installation script for the glusterfs-server handles the restarting of glusterd incorrect. This caused an outage when the glusterfs-server package was automatically updated. After checking the logs together with Patrick, we came to the conclusion that the running glusterd should have received a signal and would be exiting. However, the script does not wait for the running glusterd to exit, and starts a new glusterd process immediately after sending the SIGTERM. In case the 1st glusterd process has not exited yet, the new glusterd process can not listen on port 24007 and exits. The 1st glusterd will exit eventually too, leaving the service unavailable. Snippet from the .spec: 735 %post server ... 769 pidof -c -o %PPID -x glusterd &> /dev/null 770 if [ $? -eq 0 ]; then ... 773 killall glusterd &> /dev/null 774 glusterd --xlator-option *.upgrade=on -N 775 else 776 glusterd --xlator-option *.upgrade=on -N 777 fi ... I am not sure what the best way is to start glusterd with these specific options once. Maybe these should get listed in /etc/sysconfig/glusterd so that the standard init-script or systemd-job handles it?
Which is the primary concern, that the new glusterd was started too soon? That we need a cleaner solution for starting glusterd with the *.upgrade=on option? Or both?
REVIEW: http://review.gluster.org/8185 (build/glusterfs.spec.in: %post server doesn't wait for old glusterd) posted (#1) for review on master by Kaleb KEITHLEY (kkeithle)
REVIEW: http://review.gluster.org/8185 (build/glusterfs.spec.in: %post server doesn't wait for old glusterd) posted (#2) for review on master by Kaleb KEITHLEY (kkeithle)
REVIEW: http://review.gluster.org/8185 (build/glusterfs.spec.in: %post server doesn't wait for old glusterd) posted (#3) for review on master by Kaleb KEITHLEY (kkeithle)
REVIEW: http://review.gluster.org/8185 (build/glusterfs.spec.in: %post server doesn't wait for old glusterd) posted (#4) for review on master by Kaleb KEITHLEY (kkeithle)
COMMIT: http://review.gluster.org/8185 committed in master by Vijay Bellur (vbellur) ------ commit 858b570a0c62d31416f0aee8c385b3118a1fad43 Author: Kaleb S. KEITHLEY <kkeithle> Date: Thu Jun 26 17:14:39 2014 -0400 build/glusterfs.spec.in: %post server doesn't wait for old glusterd 'killall glusterd' needs to wait for the old glusterd to exit before starting the updated one, otherwise the new process can't bind to its socket ports Change-Id: Ib43c76f232e0ea6f7f8469fb12be7f2b907fb7c8 BUG: 1113543 Signed-off-by: Kaleb S. KEITHLEY <kkeithle> Reviewed-on: http://review.gluster.org/8185 Reviewed-by: Niels de Vos <ndevos> Reviewed-by: Lalatendu Mohanty <lmohanty> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Humble Devassy Chirammal <humble.devassy> Reviewed-by: Vijay Bellur <vbellur>
I posted the below message to the gluster-users list and was asked to also post it here: When updates applied a couple of nights ago all my Gluster nodes went down and "service glusterd status" reported it dead on all 3 nodes in my replicated setup. This seems very similar to a bug that was recently fixed (https://bugzilla.redhat.com/show_bug.cgi?id=1113543) Any ideas what's up with this? [root@eapps-gluster01 ~]# rpm -qa |grep gluster glusterfs-libs-3.5.2-1.el6.x86_64 glusterfs-cli-3.5.2-1.el6.x86_64 glusterfs-geo-replication-3.5.2-1.el6.x86_64 glusterfs-3.5.2-1.el6.x86_64 glusterfs-fuse-3.5.2-1.el6.x86_64 glusterfs-server-3.5.2-1.el6.x86_64 glusterfs-api-3.5.2-1.el6.x86_64
Looks like the fix is not working as expected. Hence moving the bug to assigned state.
I guess it could fail in case not all packages have been updated yet. There could be some library mismatches of some kind. Instead of doing the kill+restart in %post, it may be safer to do it in %posttrans (or whatever the name is)?
Just to add that I've had the same issue : updating four nodes from 3.5.0 to 3.5.2 and they all suffered from the same problem. As I already had this problem in the past, I've just done one node at a time and restarted glusterd myself after each node update.
A beta release for GlusterFS 3.6.0 has been released. Please verify if the release solves this bug report for you. In case the glusterfs-3.6.0beta1 release does not have a resolution for this issue, leave a comment in this bug and move the status to ASSIGNED. If this release fixes the problem for you, leave a note and change the status to VERIFIED. Packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update (possibly an "updates-testing" repository) infrastructure for your distribution. [1] http://supercolony.gluster.org/pipermail/gluster-users/2014-September/018836.html [2] http://supercolony.gluster.org/pipermail/gluster-users/
Looks like the fix is working in 3.6.0. I performed the following steps and found that glusterd did not get killed 1. Installed 3.6.0-0.2.beta1 2. Started glusterd 3. updated to 3.6.0-0.3.beta2 4. glusterd was still running. [root@dhcp159-233 yum.repos.d]# yum install glusterfs-server xxxxxxxxxxxxxxxxxxxxxxxxxx 6/6 Installed: glusterfs-server.x86_64 0:3.6.0-0.2.beta1.fc20 Dependency Installed: glusterfs.x86_64 0:3.6.0-0.2.beta1.fc20 glusterfs-api.x86_64 0:3.6.0-0.2.beta1.fc20 glusterfs-cli.x86_64 0:3.6.0-0.2.beta1.fc20 glusterfs-fuse.x86_64 0:3.6.0-0.2.beta1.fc20 glusterfs-libs.x86_64 0:3.6.0-0.2.beta1.fc20 Complete! [root@dhcp159-233 yum.repos.d]# systemctl start glusterd [root@dhcp159-233 yum.repos.d]# systemctl status glusterd glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled) Active: active (running) since Thu 2014-09-25 07:39:24 EDT; 6s ago Process: 29489 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid (code=exited, status=0/SUCCESS) Main PID: 29490 (glusterd) CGroup: /system.slice/glusterd.service └─29490 /usr/sbin/glusterd -p /var/run/glusterd.pid Sep 25 07:39:24 dhcp159-233.sbu.lab.eng.bos.redhat.com systemd[1]: Started GlusterFS, a clustered file-system server. Sep 25 07:39:26 dhcp159-233.sbu.lab.eng.bos.redhat.com python[29497]: SELinux is preventing /usr/sbin/glusterfsd from write access on the sock_file . ***** Plugin catchall (100. confidence) suggests **************************... Hint: Some lines were ellipsized, use -l to show in full. [root@dhcp159-233 yum.repos.d]# vi glusterfs-360beta2-fedora.repo [root@dhcp159-233 yum.repos.d]# yum update xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Updated: glusterfs.x86_64 0:3.6.0-0.3.beta2.fc20 glusterfs-api.x86_64 0:3.6.0-0.3.beta2.fc20 glusterfs-cli.x86_64 0:3.6.0-0.3.beta2.fc20 glusterfs-fuse.x86_64 0:3.6.0-0.3.beta2.fc20 glusterfs-libs.x86_64 0:3.6.0-0.3.beta2.fc20 glusterfs-server.x86_64 0:3.6.0-0.3.beta2.fc20 Complete! [root@dhcp159-233 yum.repos.d]# systemctl status glusterd glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled) Active: active (running) since Thu 2014-09-25 07:40:16 EDT; 4s ago Process: 29562 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid (code=exited, status=0/SUCCESS) Main PID: 29563 (glusterd) CGroup: /system.slice/glusterd.service └─29563 /usr/sbin/glusterd -p /var/run/glusterd.pid Sep 25 07:40:16 dhcp159-233.sbu.lab.eng.bos.redhat.com systemd[1]: Started GlusterFS, a clustered file-system server. Sep 25 07:40:17 dhcp159-233.sbu.lab.eng.bos.redhat.com python[29569]: SELinux is preventing /usr/sbin/glusterfsd from write access on the sock_file . ***** Plugin catchall (100. confidence) suggests **************************... Hint: Some lines were ellipsized, use -l to show in full.
After I installed psmisc "yum install psmisc". The can reproduce the issue i.e. glusterd killed after update.
REVIEW: http://review.gluster.org/8857 (glusterfs.spec.in: add psmisc to -server subpackage) posted (#1) for review on release-3.6 by Kaleb KEITHLEY (kkeithle)
Created attachment 941092 [details] glusterd log
Comment on attachment 941092 [details] glusterd log Log is from an update where glusterd was running, and after the update glusterd was not running any more.
figured out what the issue is: 1. glusterd is happily running 2. yum update 3. glusterd gets killed by the update 4. "glusterd --xlator-option *.upgrade=on -N" is run to update the config 5. glusterd exits when the config update is done 6. a cond-restart or try-restart is done, but glusterd is not running, so nothing gets (re)started 7. "yum update" finishes The obvious solution is to start glusterd again after it has updated its config.
REVIEW: http://review.gluster.org/8857 (glusterfs.spec.in: add psmisc to -server subpackage) posted (#2) for review on release-3.6 by Kaleb KEITHLEY (kkeithle)
REVIEW: http://review.gluster.org/8857 (glusterfs.spec.in: add psmisc to -server subpackage) posted (#3) for review on release-3.6 by Kaleb KEITHLEY (kkeithle)
REVIEW: http://review.gluster.org/8857 (glusterfs.spec.in: add psmisc to -server subpackage) posted (#4) for review on release-3.6 by Kaleb KEITHLEY (kkeithle)
REVIEW: http://review.gluster.org/8871 (glusterfs.spec.in: add psmisc to -server subpackage) posted (#1) for review on release-3.6 by Kaleb KEITHLEY (kkeithle)
COMMIT: http://review.gluster.org/8871 committed in release-3.6 by Vijay Bellur (vbellur) ------ commit dce9e79a7a7ccd5b998ca562ee026e4cfd5519c2 Author: Kaleb S. KEITHLEY <kkeithle> Date: Fri Sep 26 08:44:10 2014 -0400 glusterfs.spec.in: add psmisc to -server subpackage apparently some minimalist installs omit psmisc psmisc is needed for the killall in various %pre and %post scriptlets smarter logic for restarting glusterd in %post server Change-Id: I041f22576f25e200ff6a4e2f194e539ba7ed1d42 BUG: 1113543 Signed-off-by: Kaleb S. KEITHLEY <kkeithle> Reviewed-on: http://review.gluster.org/8871 Reviewed-by: Niels de Vos <ndevos> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Lalatendu Mohanty <lmohanty>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.6.1, please reopen this bug report. glusterfs-3.6.1 has been announced [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://supercolony.gluster.org/pipermail/gluster-users/2014-November/019410.html [2] http://supercolony.gluster.org/mailman/listinfo/gluster-users