Bug 1491059 - PID File handling: brick pid file leaves stale pid and brick fails to start when glusterd is started
Summary: PID File handling: brick pid file leaves stale pid and brick fails to start w...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 3.10
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Mohit Agrawal
QA Contact:
URL:
Whiteboard:
Depends On: 1258561 1464072
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-12 23:31 UTC by Ben Werthmann
Modified: 2017-11-01 12:58 UTC (History)
5 users (show)

Fixed In Version: glusterfs-3.10.7
Clone Of:
Environment:
Last Closed: 2017-11-01 12:58:54 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1464072 0 high CLOSED cns-brick-multiplexing: brick process fails to restart after gluster pod failure 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1491060 0 unspecified CLOSED PID File handling: self-heal-deamon pid file leaves stale pid and indiscriminately kills pid when glusterd is started 2023-09-14 04:07:45 UTC

Internal Links: 1491060

Description Ben Werthmann 2017-09-12 23:31:57 UTC
Description of problem:

brick pid file leaves stale pid and brick fails to start when glusterd is started. pid files are stored in `/var/lib/glusterd` which persists across reboots. When glusterd is started (or restarted or host rebooted) and the pid of any process matching the pid in the brick pid file, brick fails to start.

Version-Release number of selected component (if applicable):


3.10.4 from ppa:gluster/glusterfs-3.10

How reproducible:

1 to 1

Steps to Reproduce:
1. Create a volume. 
2. Enable Self-Heal Deamon
3. pid status
==> /var/lib/glusterd/glustershd/run/glustershd.pid <==
1398
==> /var/lib/glusterd/vols/vol0/run/172.28.128.5-data-brick0.pid <==
1407
4. killall -w glusterfsd
5. sleep infinity & pid=$!
6. echo $pid >/var/lib/glusterd/vols/vol0/run/172.28.128.5-data-brick0.pid
7. service glusterfs-server restart
glusterfs-server stop/waiting
glusterfs-server start/running, process 1548
8. gluster v status
Status of volume: vol0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 172.28.128.5:/data/brick0             N/A       N/A        N       N/A  
Brick 172.28.128.6:/data/brick0             49152     0          Y       11023
Self-heal Daemon on localhost               N/A       N/A        Y       1684 
Self-heal Daemon on 172.28.128.6            N/A       N/A        Y       11044
 
Task Status of Volume vol0
------------------------------------------------------------------------------
There are no active volume tasks

Workaround:
9. rm /var/lib/glusterd/vols/vol0/run/172.28.128.5-data-brick0.pid
10. service glusterfs-server restart
glusterfs-server stop/waiting
glusterfs-server start/running, process 1743
11. gluster v status
Status of volume: vol0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 172.28.128.5:/data/brick0             49152     0          Y       1888 
Brick 172.28.128.6:/data/brick0             49152     0          Y       11023
Self-heal Daemon on localhost               N/A       N/A        Y       1879 
Self-heal Daemon on 172.28.128.6            N/A       N/A        Y       11044
 
Task Status of Volume vol0
------------------------------------------------------------------------------
There are no active volume tasks


Actual results:
1. brick pid file(s) remain after brick is stopped
2. glusterd fails to start brick when the pid in the pid file matches any process

Expected results:
1. brick pid file(s) should be cleaned up when the brick is stopped gracefully
2. glusterd should start the brick when the process in the pid file is not a glusterfd process

Additional info:
OS is Ubuntu Trusty

Workaround:

in our automation, when we stop all gluster processes (reboot, upgrade, etc.) we ensure all processes are stopped and then cleanup the pids with 'find /var/lib/glusterd/ -name '*pid' -delete'

Comment 1 Ben Werthmann 2017-09-12 23:41:19 UTC
Looks like there may be a fix for this already:
https://review.gluster.org/#/c/13580/
https://review.gluster.org/#/c/17601

Comment 2 Ben Werthmann 2017-09-13 19:57:46 UTC
May also lead to situations like this:

$ gluster vol heal $vol statistics
Gathering crawl statistics on volume $vol has been unsuccessful on bricks that are down. Please check if all brick processes are running.

or

gluster v heal testvol statistics
Gathering crawl statistics on volume testvol has been unsuccessful:
  Staging failed on vm1. Error: Self-heal daemon is not running. Check 
self-heal daemon log file./

Comment 3 Ben Werthmann 2017-09-13 20:25:56 UTC
Also occurs with 3.10.5 from ppa:gluster/glusterfs-3.10

Comment 4 Ben Werthmann 2017-09-13 20:27:24 UTC
Upgrading to urgent as this affects stability of gluster in general.

Comment 5 Atin Mukherjee 2017-09-18 07:39:49 UTC
commit 220d406ad13d840e950eef001a2b36f87570058d
Author: Gaurav Kumar Garg <garg.gaurav52>
Date:   Wed Mar 2 17:42:07 2016 +0530

    glusterd: Gluster should keep PID file in correct location
    
    Currently Gluster keeps process pid information of all the daemons
    and brick processes in Gluster configuration file directory
    (ie., /var/lib/glusterd/*).
    
    These pid files should be seperate from configuration files.
    Deletion of the configuration file directory might result into serious problems.
    Also, /var/run/gluster is the default placeholder directory for pid files.
    
    So, with this fix Gluster will keep all process pid information of all
    processes in /var/run/gluster/* directory.
    
    Change-Id: Idb09e3fccb6a7355fbac1df31082637c8d7ab5b4
    BUG: 1258561
    Signed-off-by: Gaurav Kumar Garg <ggarg>
    Signed-off-by: Saravanakumar Arumugam <sarumuga>
    Reviewed-on: https://review.gluster.org/13580
    Tested-by: MOHIT AGRAWAL <moagrawa>
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Atin Mukherjee <amukherj>

The above commit takes care of this issue. Please note this fix is available in release-3.12 branch. Since this is a major change in the way pidfiles are placed, I don't have a plan to cherry pick this into release-3.10 branch.

Comment 6 Atin Mukherjee 2017-09-25 15:17:51 UTC
Ben - Do you mind if I close this issue now? As I mentioned in the earlier comment, a stable release branch may not accept this change in the behaviour. So if you're fine with the workaround, you can choose to stick to release-3.10 branch otherwise please upgrade to release-3.12?

Comment 7 Ben Werthmann 2017-09-25 20:11:56 UTC
I think there should be a minimal fix for 3.10. The minimal fix in this context is:

- glusterd should start the brick when the process in the pid file is not a glusterfd process

I will also run my tests with 3.12 and report results.

Comment 8 Atin Mukherjee 2017-10-09 04:15:22 UTC
Mohit - can you please backport https://review.gluster.org/13580 to release-3.10 branch?

Comment 9 Worker Ant 2017-10-11 05:52:14 UTC
REVIEW: https://review.gluster.org/18484 (glusterd: Gluster should keep PID file in correct location) posted (#1) for review on release-3.10 by MOHIT AGRAWAL (moagrawa)

Comment 10 Worker Ant 2017-10-25 14:03:58 UTC
COMMIT: https://review.gluster.org/18484 committed in release-3.10 by Shyamsundar Ranganathan (srangana) 
------
commit 411a401f7e4f81f6a77eea1438a3a43c73e06104
Author: Gaurav Kumar Garg <garg.gaurav52>
Date:   Wed Mar 2 17:42:07 2016 +0530

    glusterd: Gluster should keep PID file in correct location
    
    Currently Gluster keeps process pid information of all the daemons
    and brick processes in Gluster configuration file directory
    (ie., /var/lib/glusterd/*).
    
    These pid files should be seperate from configuration files.
    Deletion of the configuration file directory might result into serious problems.
    Also, /var/run/gluster is the default placeholder directory for pid files.
    
    So, with this fix Gluster will keep all process pid information of all
    processes in /var/run/gluster/* directory.
    
    > Change-Id: Idb09e3fccb6a7355fbac1df31082637c8d7ab5b4
    > BUG: 1258561
    > Signed-off-by: Gaurav Kumar Garg <ggarg>
    > Signed-off-by: Saravanakumar Arumugam <sarumuga>
    > Reviewed-on: https://review.gluster.org/13580
    > Tested-by: MOHIT AGRAWAL <moagrawa>
    > Smoke: Gluster Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: Atin Mukherjee <amukherj>
    > (Cherry pick from commit 220d406ad13d840e950eef001a2b36f87570058d)
    
    BUG: 1491059
    Change-Id: Idb09e3fccb6a7355fbac1df31082637c8d7ab5b4
    Signed-off-by: Mohit Agrawal <moagrawa>

Comment 11 Shyamsundar 2017-11-01 12:58:54 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.7, please open a new bug report.

glusterfs-3.10.7 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-November/000085.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.