Bug 1262977

Summary: [RFE] verify that systemd config prevents repeatedly restarting daemons
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Samuel Just <sjust>
Component: RADOSAssignee: Boris Ranto <branto>
Status: CLOSED ERRATA QA Contact: Rachana Patel <racpatel>
Severity: medium Docs Contact: Bara Ancincova <bancinco>
Priority: unspecified    
Version: 1.2.3CC: branto, ceph-eng-bugs, dzafman, gfarnum, hnallurv, kchai, kdreyer, nlevine
Target Milestone: rcKeywords: FutureFeature
Target Release: 2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-10.2.1-12.el7cp.x86_64 Doc Type: Enhancement
Doc Text:
.`systemd` now restarts failed Ceph services When a Ceph service, such as `ceph-mon` or `ceph-osd`, fails to start, the `systemd` daemon now attempts to restart the service. Prior to this update, Ceph services remained in the failed state.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-23 19:27:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1322504    

Description Samuel Just 2015-09-14 19:31:47 UTC
Description of problem:

Linked bug describes the situation for upstart, verify that systemd behaves properly.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Ken Dreyer (Red Hat) 2015-09-14 19:40:54 UTC
systemd support will land in Infernalis/Jewel -> re-targeting

Comment 4 Ken Dreyer (Red Hat) 2016-02-29 19:34:05 UTC
Boris, I think we need to add the following to the systemd unit files:

  Restart=on-failure
  StartLimitBurst=3
  StartLimitInterval=1800

to match Upstart's "respawn" and "respawn limit" settings? Would you please submit a PR for that?

Comment 5 Boris Ranto 2016-03-15 21:17:18 UTC
Ken, we should definitely add the

Restart=on-failure

line if we want systemd to actually attempt to restart the services on failure. However, I'm not sure we want to override the defaults for the restarts -- systemd does have its own defaults for how often can a process fail in a period of time for it to give up.

Comment 6 Greg Farnum 2016-03-15 21:42:16 UTC
Those restart limits were chosen reasonably carefully based on characteristics of Ceph OSDs as IO-consuming beasts, of Ceph clusters as a whole, and the interaction between those two. The SystemD default process limits are unlikely to be useful in that regard, and we have those custom limits based on issues customers have run into with different rules. ;)

Comment 7 Boris Ranto 2016-03-17 17:58:03 UTC
OK, that makes sense. The upstream PR:

https://github.com/ceph/ceph/pull/8188

Comment 12 Rachana Patel 2016-06-13 23:19:11 UTC
verified with version 10.2.1-12.el7cp.x86_64.



[root@magna084 ubuntu]# ps auxww | grep ceph
ceph       34998  0.0  0.0 336964 24056 ?        Ssl  22:19   0:00 /usr/bin/ceph-mon -f --cluster ceph --id magna084 --setuser ceph --setgroup ceph
ceph       35108  0.2  0.0 877708 25644 ?        Ssl  22:19   0:00 /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
root       40539  0.0  0.0 112648   976 pts/1    S+   22:24   0:00 grep --color=auto ceph

[root@magna084 ubuntu]# kill -9 35108

[root@magna084 ubuntu]# ps auxww | grep ceph
ceph       34998  0.0  0.0 339012 25248 ?        Ssl  22:19   0:00 /usr/bin/ceph-mon -f --cluster ceph --id magna084 --setuser ceph --setgroup ceph
ceph       40620  1.7  0.0 862240 28204 ?        Ssl  22:26   0:00 /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
root       40786  0.0  0.0 112648   972 pts/1    S+   22:26   0:00 grep --color=auto ceph

[root@magna084 ubuntu]# kill -9 40620

[root@magna084 ubuntu]# ps auxww | grep ceph
ceph       34998  0.0  0.0 339012 25920 ?        Ssl  22:19   0:00 /usr/bin/ceph-mon -f --cluster ceph --id magna084 --setuser ceph --setgroup ceph
ceph       40841  4.6  0.0 862260 25584 ?        Ssl  22:27   0:00 /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
root       40983  0.0  0.0 112648   976 pts/1    S+   22:27   0:00 grep --color=auto ceph


[root@magna084 ubuntu]# kill -9 40841
[root@magna084 ubuntu]# ps auxww | grep ceph
ceph       34998  0.0  0.0 340036 25572 ?        Ssl  22:19   0:00 /usr/bin/ceph-mon -f --cluster ceph --id magna084 --setuser ceph --setgroup ceph
ceph       41038  3.3  0.0 869244 20764 ?        Ssl  22:27   0:00 /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
root       41184  0.0  0.0 112648   972 pts/1    S+   22:27   0:00 grep --color=auto ceph

[root@magna084 ubuntu]# kill -9 41038
[root@magna084 ubuntu]# ps auxww | grep ceph
ceph       34998  0.0  0.0 340036 26512 ?        Ssl  22:19   0:00 /usr/bin/ceph-mon -f --cluster ceph --id magna084 --setuser ceph --setgroup ceph
root       41193  0.0  0.0 112648   976 pts/1    S+   22:28   0:00 grep --color=auto ceph


Hence moving to verified

Comment 14 errata-xmlrpc 2016-08-23 19:27:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1755