Description of problem: brick pid file leaves stale pid and brick fails to start when glusterd is started. pid files are stored in `/var/lib/glusterd` which persists across reboots. When glusterd is started (or restarted or host rebooted) and the pid of any process matching the pid in the brick pid file, brick fails to start. Version-Release number of selected component (if applicable): 3.10.4 from ppa:gluster/glusterfs-3.10 How reproducible: 1 to 1 Steps to Reproduce: 1. Create a volume. 2. Enable Self-Heal Deamon 3. pid status ==> /var/lib/glusterd/glustershd/run/glustershd.pid <== 1398 ==> /var/lib/glusterd/vols/vol0/run/172.28.128.5-data-brick0.pid <== 1407 4. killall -w glusterfsd 5. sleep infinity & pid=$! 6. echo $pid >/var/lib/glusterd/vols/vol0/run/172.28.128.5-data-brick0.pid 7. service glusterfs-server restart glusterfs-server stop/waiting glusterfs-server start/running, process 1548 8. gluster v status Status of volume: vol0 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 172.28.128.5:/data/brick0 N/A N/A N N/A Brick 172.28.128.6:/data/brick0 49152 0 Y 11023 Self-heal Daemon on localhost N/A N/A Y 1684 Self-heal Daemon on 172.28.128.6 N/A N/A Y 11044 Task Status of Volume vol0 ------------------------------------------------------------------------------ There are no active volume tasks Workaround: 9. rm /var/lib/glusterd/vols/vol0/run/172.28.128.5-data-brick0.pid 10. service glusterfs-server restart glusterfs-server stop/waiting glusterfs-server start/running, process 1743 11. gluster v status Status of volume: vol0 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 172.28.128.5:/data/brick0 49152 0 Y 1888 Brick 172.28.128.6:/data/brick0 49152 0 Y 11023 Self-heal Daemon on localhost N/A N/A Y 1879 Self-heal Daemon on 172.28.128.6 N/A N/A Y 11044 Task Status of Volume vol0 ------------------------------------------------------------------------------ There are no active volume tasks Actual results: 1. brick pid file(s) remain after brick is stopped 2. glusterd fails to start brick when the pid in the pid file matches any process Expected results: 1. brick pid file(s) should be cleaned up when the brick is stopped gracefully 2. glusterd should start the brick when the process in the pid file is not a glusterfd process Additional info: OS is Ubuntu Trusty Workaround: in our automation, when we stop all gluster processes (reboot, upgrade, etc.) we ensure all processes are stopped and then cleanup the pids with 'find /var/lib/glusterd/ -name '*pid' -delete'
Looks like there may be a fix for this already: https://review.gluster.org/#/c/13580/ https://review.gluster.org/#/c/17601
May also lead to situations like this: $ gluster vol heal $vol statistics Gathering crawl statistics on volume $vol has been unsuccessful on bricks that are down. Please check if all brick processes are running. or gluster v heal testvol statistics Gathering crawl statistics on volume testvol has been unsuccessful: Staging failed on vm1. Error: Self-heal daemon is not running. Check self-heal daemon log file./
Also occurs with 3.10.5 from ppa:gluster/glusterfs-3.10
Upgrading to urgent as this affects stability of gluster in general.
commit 220d406ad13d840e950eef001a2b36f87570058d Author: Gaurav Kumar Garg <garg.gaurav52> Date: Wed Mar 2 17:42:07 2016 +0530 glusterd: Gluster should keep PID file in correct location Currently Gluster keeps process pid information of all the daemons and brick processes in Gluster configuration file directory (ie., /var/lib/glusterd/*). These pid files should be seperate from configuration files. Deletion of the configuration file directory might result into serious problems. Also, /var/run/gluster is the default placeholder directory for pid files. So, with this fix Gluster will keep all process pid information of all processes in /var/run/gluster/* directory. Change-Id: Idb09e3fccb6a7355fbac1df31082637c8d7ab5b4 BUG: 1258561 Signed-off-by: Gaurav Kumar Garg <ggarg> Signed-off-by: Saravanakumar Arumugam <sarumuga> Reviewed-on: https://review.gluster.org/13580 Tested-by: MOHIT AGRAWAL <moagrawa> Smoke: Gluster Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Atin Mukherjee <amukherj> The above commit takes care of this issue. Please note this fix is available in release-3.12 branch. Since this is a major change in the way pidfiles are placed, I don't have a plan to cherry pick this into release-3.10 branch.
Ben - Do you mind if I close this issue now? As I mentioned in the earlier comment, a stable release branch may not accept this change in the behaviour. So if you're fine with the workaround, you can choose to stick to release-3.10 branch otherwise please upgrade to release-3.12?
I think there should be a minimal fix for 3.10. The minimal fix in this context is: - glusterd should start the brick when the process in the pid file is not a glusterfd process I will also run my tests with 3.12 and report results.
Mohit - can you please backport https://review.gluster.org/13580 to release-3.10 branch?
REVIEW: https://review.gluster.org/18484 (glusterd: Gluster should keep PID file in correct location) posted (#1) for review on release-3.10 by MOHIT AGRAWAL (moagrawa)
COMMIT: https://review.gluster.org/18484 committed in release-3.10 by Shyamsundar Ranganathan (srangana) ------ commit 411a401f7e4f81f6a77eea1438a3a43c73e06104 Author: Gaurav Kumar Garg <garg.gaurav52> Date: Wed Mar 2 17:42:07 2016 +0530 glusterd: Gluster should keep PID file in correct location Currently Gluster keeps process pid information of all the daemons and brick processes in Gluster configuration file directory (ie., /var/lib/glusterd/*). These pid files should be seperate from configuration files. Deletion of the configuration file directory might result into serious problems. Also, /var/run/gluster is the default placeholder directory for pid files. So, with this fix Gluster will keep all process pid information of all processes in /var/run/gluster/* directory. > Change-Id: Idb09e3fccb6a7355fbac1df31082637c8d7ab5b4 > BUG: 1258561 > Signed-off-by: Gaurav Kumar Garg <ggarg> > Signed-off-by: Saravanakumar Arumugam <sarumuga> > Reviewed-on: https://review.gluster.org/13580 > Tested-by: MOHIT AGRAWAL <moagrawa> > Smoke: Gluster Build System <jenkins.org> > CentOS-regression: Gluster Build System <jenkins.org> > Reviewed-by: Atin Mukherjee <amukherj> > (Cherry pick from commit 220d406ad13d840e950eef001a2b36f87570058d) BUG: 1491059 Change-Id: Idb09e3fccb6a7355fbac1df31082637c8d7ab5b4 Signed-off-by: Mohit Agrawal <moagrawa>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.7, please open a new bug report. glusterfs-3.10.7 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-November/000085.html [2] https://www.gluster.org/pipermail/gluster-users/