Bug 1776264 - RFE: systemd should restart glusterd on crash
Summary: RFE: systemd should restart glusterd on crash
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: mainline
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Sanju
QA Contact:
URL:
Whiteboard:
Depends On: 1663557
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-25 11:41 UTC by Sanju
Modified: 2020-01-09 12:45 UTC (History)
10 users (show)

Fixed In Version:
Clone Of: 1663557
Environment:
Last Closed: 2019-12-05 07:47:41 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gluster.org Gerrit 23751 0 None Merged glusterd: start glusterd automatically on abnormal shutdown 2019-12-05 07:47:40 UTC

Description Sanju 2019-11-25 11:41:51 UTC
Description of problem:
Currently, systemd is used to manage glusterd, but after the initial start, it does not ensure glusterd continues to run. Within limits, systemd should attempt to restart glusterd if it crashes in order to better handle transient failures.


Version-Release number of selected component (if applicable):
glusterfs-fuse-3.12.2-25.el7rhgs.x86_64
python2-gluster-3.12.2-25.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-libs-3.12.2-25.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-25.el7rhgs.x86_64
glusterfs-cli-3.12.2-25.el7rhgs.x86_64
glusterfs-api-3.12.2-25.el7rhgs.x86_64
glusterfs-3.12.2-25.el7rhgs.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
glusterfs-server-3.12.2-25.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
pcp-pmda-gluster-4.3.0-0.201812061439.git24488c63.el7.x86_64
glusterfs-geo-replication-3.12.2-25.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-4.5.0-10.el7_6.3.x86_64
glusterfs-rdma-3.12.2-25.el7rhgs.x86_64


How reproducible:
100%... if glusterd crashes, it stays down.


Steps to Reproduce:
1. Encounter glusterd SEGV
2. Observe the lack of restart


Actual results:
Glusterd is not automatically restarted on failure


Expected results:
For occasional crashes, we should use systemd to restart glusterd


Additional info:
This request comes from my experience maintaining openshift.io. We encounter periodic crashes of gd, usually due to monitoring operations. In order to have automatic recovery from these crashes, I have adjusted the unit file as follows...
In the [Service] section, I have added:

StartLimitBurst=3
StartLimitIntervalSec=3600
StartLimitInterval=3600
Restart=on-abnormal
RestartSec=60

The above causes systemd to automatically restart glusterd if it crashes. It will restart up to 3 times over a 1 hour period. This has the effect of masking the occasional failure, but will leave the daemon down if failures exceed the threshold (at which point other monitoring will raise an alert).

We should consider incorporating the above (or a variant thereof) into the standard distribution.

Comment 1 Worker Ant 2019-11-25 11:59:22 UTC
REVIEW: https://review.gluster.org/23751 (glusterd: start glusterd automatically on abnormal shutdown) posted (#1) for review on master by Sanju Rakonde

Comment 2 Worker Ant 2019-12-05 07:47:41 UTC
REVIEW: https://review.gluster.org/23751 (glusterd: start glusterd automatically on abnormal shutdown) merged (#2) on master by MOHIT AGRAWAL


Note You need to log in before you can comment on or make changes to this bug.