Bug 1491060

Summary: PID File handling: self-heal-deamon pid file leaves stale pid and indiscriminately kills pid when glusterd is started
Product: [Community] GlusterFS Reporter: Ben Werthmann <ben>
Component: glusterdAssignee: bugs <bugs>
Status: CLOSED EOL QA Contact:
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.10CC: amukherj, ben, bugs, joe, moagrawa, peljasz
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-20 18:30:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1258561, 1464072    
Bug Blocks:    

Description Ben Werthmann 2017-09-12 23:32:01 UTC
Description of problem:

self-heal-deamon pid file leave stale pid and indiscriminately kills pid when glusterd is started. pid files are stored in `/var/lib/glusterd` which persists across reboots. When glusterd is started (or restarted or host rebooted) the pid of any process matching the pid in the shd pid file is killed.

Version-Release number of selected component (if applicable):

3.10.4 from ppa:gluster/glusterfs-3.10

How reproducible:

1 to 1

Steps to Reproduce:
1. Create a volume. 
2. Enable Self-Heal Deamon
3. pid status
find /var/lib/glusterd/ -name '*pid' -exec tail -v {} \;
==> /var/lib/glusterd/glustershd/run/glustershd.pid <==
11642
==> /var/lib/glusterd/vols/vol0/run/172.28.128.5-data-brick0.pid <==
11169
4. killall -w glusterfs
5. create a process, background it, record the pid
sleep infinity & pid=$!
[1] 11669
6. put the pid of the process into the pid file
echo $pid >/ var/lib/glusterd/glustershd/run/glustershd.pid
7. confirm above
find /var/lib/glusterd/ -name '*pid' -exec tail -v {} \;
==> /var/lib/glusterd/glustershd/run/glustershd.pid <==
11669
==> /var/lib/glusterd/vols/vol0/run/172.28.128.5-data-brick0.pid <==
11169
8. restart glusterfs-server
service glusterfs-server restart
glusterfs-server stop/waiting
glusterfs-server start/running, process 11687
9. shell notifies that the background process was terminated
[1]+  Terminated              sleep infinity
10. shd starts, but kills a process other than glusterfs
gluster v status
Status of volume: vol0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 172.28.128.5:/data/brick0             49152     0          Y       11169
Brick 172.28.128.6:/data/brick0             49152     0          Y       11023
Self-heal Daemon on localhost               N/A       N/A        Y       12023
Self-heal Daemon on 172.28.128.6            N/A       N/A        Y       11044
 
Task Status of Volume vol0
------------------------------------------------------------------------------
There are no active volume tasks


Note: In some cases shd fails to start. 

Note2: In one case I saw the same pid listed for the brick and shd. In this case the brick was terminated when shd started.
find /var/lib/glusterd/ -name '*pid' -exec tail -v {} \;
==> /var/lib/glusterd/vols/apcfs-default/run/172.27.0.19-data-brick0.pid <==
1468
==> /var/lib/glusterd/glustershd/run/glustershd.pid <==
1468


Actual results:
1. pid file /var/lib/glusterd/glustershd/run/glustershd.pid remains after shd is stopped
2. glusterd kills any process number in the stale pid file.

Expected results:
1. shd pid file should be cleaned up
2. glusterd should only kill instances of glusterfs process


Additional info:
OS is Ubuntu Trusty

Workaround:

in our automation, when we stop all gluster processes (reboot, upgrade, etc.) we ensure all processes are stopped and then cleanup the pids with 'find /var/lib/glusterd/ -name '*pid' -delete'

Comment 1 Ben Werthmann 2017-09-12 23:42:47 UTC
Not sure if these patches will help:
Looks like there may be a fix for this already:
https://review.gluster.org/#/c/13580/
https://review.gluster.org/#/c/17601

Specifically with the 'glusterd kills any process number in the stale pid file.' behavior.

Comment 2 Ben Werthmann 2017-09-13 19:57:44 UTC
May also lead to situations like this:

$ gluster vol heal $vol statistics
Gathering crawl statistics on volume $vol has been unsuccessful on bricks that are down. Please check if all brick processes are running.

or

gluster v heal testvol statistics
Gathering crawl statistics on volume testvol has been unsuccessful:
  Staging failed on vm1. Error: Self-heal daemon is not running. Check 
self-heal daemon log file./

Comment 3 Ben Werthmann 2017-09-13 20:26:03 UTC
Also occurs with 3.10.5 from ppa:gluster/glusterfs-3.10

Comment 4 Ben Werthmann 2017-09-13 20:27:34 UTC
Upgrading to urgent as this affects stability of gluster in general.

Comment 5 Atin Mukherjee 2017-09-18 07:39:40 UTC
commit 220d406ad13d840e950eef001a2b36f87570058d
Author: Gaurav Kumar Garg <garg.gaurav52>
Date:   Wed Mar 2 17:42:07 2016 +0530

    glusterd: Gluster should keep PID file in correct location
    
    Currently Gluster keeps process pid information of all the daemons
    and brick processes in Gluster configuration file directory
    (ie., /var/lib/glusterd/*).
    
    These pid files should be seperate from configuration files.
    Deletion of the configuration file directory might result into serious problems.
    Also, /var/run/gluster is the default placeholder directory for pid files.
    
    So, with this fix Gluster will keep all process pid information of all
    processes in /var/run/gluster/* directory.
    
    Change-Id: Idb09e3fccb6a7355fbac1df31082637c8d7ab5b4
    BUG: 1258561
    Signed-off-by: Gaurav Kumar Garg <ggarg>
    Signed-off-by: Saravanakumar Arumugam <sarumuga>
    Reviewed-on: https://review.gluster.org/13580
    Tested-by: MOHIT AGRAWAL <moagrawa>
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Atin Mukherjee <amukherj>

The above commit takes care of this issue. Please note this fix is available in release-3.12 branch. Since this is a major change in the way pidfiles are placed, I don't have a plan to cherry pick this into release-3.10 branch.

Comment 6 Atin Mukherjee 2017-09-25 15:17:17 UTC
Ben - Do you mind if I close this issue now? As I mentioned in the earlier comment, a stable release branch may not accept this change in the behaviour. So if you're fine with the workaround, you can choose to stick to release-3.10 branch otherwise please upgrade to release-3.12?

Comment 7 Ben Werthmann 2017-09-25 20:14:39 UTC
I think there should be a minimal fix for 3.10. The minimal fix in this context is:

- glusterd should only kill pid in glustershd pid file when the pid is a glusterfs process

I will also run my tests with 3.12 and report results.

Comment 8 lejeczek 2017-09-29 08:42:58 UTC
I'll just chip in my vote - I emailed for help mailling list too - please fix this in 3.10. It's freaking frustrating problem.

Comment 9 Joe Julian 2017-09-29 13:37:19 UTC
Killing indiscriminate processes should be considered a major bug and a fix should most definitely be implemented in all supported branches.

Comment 10 Atin Mukherjee 2017-10-09 04:15:30 UTC
Mohit - can you please backport https://review.gluster.org/13580 to release-3.10 branch?

Comment 11 Atin Mukherjee 2017-10-16 11:20:54 UTC
upstream patch : https://review.gluster.org/18025

Comment 12 Atin Mukherjee 2017-10-25 18:43:17 UTC
(In reply to Atin Mukherjee from comment #11)
> upstream patch : https://review.gluster.org/18025

This was a wrong patch link. https://review.gluster.org/18484 is the right one and it got merged in 3.10 branch.

Comment 13 Shyamsundar 2018-06-20 18:30:19 UTC
This bug reported is against a version of Gluster that is no longer maintained
(or has been EOL'd). See https://www.gluster.org/release-schedule/ for the
versions currently maintained.

As a result this bug is being closed.

If the bug persists on a maintained version of gluster or against the mainline
gluster repository, request that it be reopened and the Version field be marked
appropriately.

Comment 14 Red Hat Bugzilla 2023-09-14 04:07:46 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days