1491060 – PID File handling: self-heal-deamon pid file leaves stale pid and indiscriminately kills pid when glusterd is started

Bug 1491060 - PID File handling: self-heal-deamon pid file leaves stale pid and indiscriminately kills pid when glusterd is started

Summary: PID File handling: self-heal-deamon pid file leaves stale pid and indiscrimin...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	3.10
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1258561 1464072
Blocks:
TreeView+	depends on / blocked

Reported:	2017-09-12 23:32 UTC by Ben Werthmann
Modified:	2023-09-14 04:07 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-06-20 18:30:19 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1464072	0	high	Closed	[RFE] Smart card redirection with Windows guest	2022-06-21 13:31:04 UTC
Red Hat Bugzilla	1491059	0	unspecified	CLOSED	PID File handling: brick pid file leaves stale pid and brick fails to start when glusterd is started	2021-02-22 00:41:40 UTC

Internal Links: 1491059

Description Ben Werthmann 2017-09-12 23:32:01 UTC

Description of problem:

self-heal-deamon pid file leave stale pid and indiscriminately kills pid when glusterd is started. pid files are stored in `/var/lib/glusterd` which persists across reboots. When glusterd is started (or restarted or host rebooted) the pid of any process matching the pid in the shd pid file is killed.

Version-Release number of selected component (if applicable):

3.10.4 from ppa:gluster/glusterfs-3.10

How reproducible:

1 to 1

Steps to Reproduce:
1. Create a volume. 
2. Enable Self-Heal Deamon
3. pid status
find /var/lib/glusterd/ -name '*pid' -exec tail -v {} \;
==> /var/lib/glusterd/glustershd/run/glustershd.pid <==
11642
==> /var/lib/glusterd/vols/vol0/run/172.28.128.5-data-brick0.pid <==
11169
4. killall -w glusterfs
5. create a process, background it, record the pid
sleep infinity & pid=$!
[1] 11669
6. put the pid of the process into the pid file
echo $pid >/ var/lib/glusterd/glustershd/run/glustershd.pid
7. confirm above
find /var/lib/glusterd/ -name '*pid' -exec tail -v {} \;
==> /var/lib/glusterd/glustershd/run/glustershd.pid <==
11669
==> /var/lib/glusterd/vols/vol0/run/172.28.128.5-data-brick0.pid <==
11169
8. restart glusterfs-server
service glusterfs-server restart
glusterfs-server stop/waiting
glusterfs-server start/running, process 11687
9. shell notifies that the background process was terminated
[1]+  Terminated              sleep infinity
10. shd starts, but kills a process other than glusterfs
gluster v status
Status of volume: vol0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 172.28.128.5:/data/brick0             49152     0          Y       11169
Brick 172.28.128.6:/data/brick0             49152     0          Y       11023
Self-heal Daemon on localhost               N/A       N/A        Y       12023
Self-heal Daemon on 172.28.128.6            N/A       N/A        Y       11044
 
Task Status of Volume vol0
------------------------------------------------------------------------------
There are no active volume tasks


Note: In some cases shd fails to start. 

Note2: In one case I saw the same pid listed for the brick and shd. In this case the brick was terminated when shd started.
find /var/lib/glusterd/ -name '*pid' -exec tail -v {} \;
==> /var/lib/glusterd/vols/apcfs-default/run/172.27.0.19-data-brick0.pid <==
1468
==> /var/lib/glusterd/glustershd/run/glustershd.pid <==
1468


Actual results:
1. pid file /var/lib/glusterd/glustershd/run/glustershd.pid remains after shd is stopped
2. glusterd kills any process number in the stale pid file.

Expected results:
1. shd pid file should be cleaned up
2. glusterd should only kill instances of glusterfs process


Additional info:
OS is Ubuntu Trusty

Workaround:

in our automation, when we stop all gluster processes (reboot, upgrade, etc.) we ensure all processes are stopped and then cleanup the pids with 'find /var/lib/glusterd/ -name '*pid' -delete'

Comment 1 Ben Werthmann 2017-09-12 23:42:47 UTC

Not sure if these patches will help:
Looks like there may be a fix for this already:
https://review.gluster.org/#/c/13580/
https://review.gluster.org/#/c/17601

Specifically with the 'glusterd kills any process number in the stale pid file.' behavior.

Comment 2 Ben Werthmann 2017-09-13 19:57:44 UTC

May also lead to situations like this:

$ gluster vol heal $vol statistics
Gathering crawl statistics on volume $vol has been unsuccessful on bricks that are down. Please check if all brick processes are running.

or

gluster v heal testvol statistics
Gathering crawl statistics on volume testvol has been unsuccessful:
  Staging failed on vm1. Error: Self-heal daemon is not running. Check 
self-heal daemon log file./

Comment 3 Ben Werthmann 2017-09-13 20:26:03 UTC

Also occurs with 3.10.5 from ppa:gluster/glusterfs-3.10

Comment 4 Ben Werthmann 2017-09-13 20:27:34 UTC

Upgrading to urgent as this affects stability of gluster in general.

Comment 5 Atin Mukherjee 2017-09-18 07:39:40 UTC

commit 220d406ad13d840e950eef001a2b36f87570058d
Author: Gaurav Kumar Garg <garg.gaurav52>
Date:   Wed Mar 2 17:42:07 2016 +0530

    glusterd: Gluster should keep PID file in correct location
    
    Currently Gluster keeps process pid information of all the daemons
    and brick processes in Gluster configuration file directory
    (ie., /var/lib/glusterd/*).
    
    These pid files should be seperate from configuration files.
    Deletion of the configuration file directory might result into serious problems.
    Also, /var/run/gluster is the default placeholder directory for pid files.
    
    So, with this fix Gluster will keep all process pid information of all
    processes in /var/run/gluster/* directory.
    
    Change-Id: Idb09e3fccb6a7355fbac1df31082637c8d7ab5b4
    BUG: 1258561
    Signed-off-by: Gaurav Kumar Garg <ggarg>
    Signed-off-by: Saravanakumar Arumugam <sarumuga>
    Reviewed-on: https://review.gluster.org/13580
    Tested-by: MOHIT AGRAWAL <moagrawa>
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Atin Mukherjee <amukherj>

The above commit takes care of this issue. Please note this fix is available in release-3.12 branch. Since this is a major change in the way pidfiles are placed, I don't have a plan to cherry pick this into release-3.10 branch.

Comment 6 Atin Mukherjee 2017-09-25 15:17:17 UTC

Ben - Do you mind if I close this issue now? As I mentioned in the earlier comment, a stable release branch may not accept this change in the behaviour. So if you're fine with the workaround, you can choose to stick to release-3.10 branch otherwise please upgrade to release-3.12?

Comment 7 Ben Werthmann 2017-09-25 20:14:39 UTC

I think there should be a minimal fix for 3.10. The minimal fix in this context is:

- glusterd should only kill pid in glustershd pid file when the pid is a glusterfs process

I will also run my tests with 3.12 and report results.

Comment 8 lejeczek 2017-09-29 08:42:58 UTC

I'll just chip in my vote - I emailed for help mailling list too - please fix this in 3.10. It's freaking frustrating problem.

Comment 9 Joe Julian 2017-09-29 13:37:19 UTC

Killing indiscriminate processes should be considered a major bug and a fix should most definitely be implemented in all supported branches.

Comment 10 Atin Mukherjee 2017-10-09 04:15:30 UTC

Mohit - can you please backport https://review.gluster.org/13580 to release-3.10 branch?

Comment 11 Atin Mukherjee 2017-10-16 11:20:54 UTC

upstream patch : https://review.gluster.org/18025

Comment 12 Atin Mukherjee 2017-10-25 18:43:17 UTC

(In reply to Atin Mukherjee from comment #11)
> upstream patch : https://review.gluster.org/18025

This was a wrong patch link. https://review.gluster.org/18484 is the right one and it got merged in 3.10 branch.

Comment 13 Shyamsundar 2018-06-20 18:30:19 UTC

This bug reported is against a version of Gluster that is no longer maintained
(or has been EOL'd). See https://www.gluster.org/release-schedule/ for the
versions currently maintained.

As a result this bug is being closed.

If the bug persists on a maintained version of gluster or against the mainline
gluster repository, request that it be reopened and the Version field be marked
appropriately.

Comment 14 Red Hat Bugzilla 2023-09-14 04:07:46 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.