Red Hat Bugzilla – Bug 1310111
[RFE] Sat6 services to be configured to restart on failure
Last modified: 2017-11-02 13:22:11 EDT
Description of problem:
configure Sat6 essential services to automatically restart on failure.
Sat6 relies on various services that are essential for various functionality of the product. If such a service fails due to whatever reason (say, segfault), the functionality is temporarily disabled until an administrator intervention. That often comes only at the end of the sequence: some service failed -> some functionality doesnt work -> customer not notified / doesnt check logs -> after some time, they realize the functionality does not work -> raising support case to Red Hat -> takes time for us to identify the cause -> service restarted.
The functionality downtime and Red Hat support intervention is ridiculously high.
(Sat6 health-check script would alleviate this pain, to some extend. But even with that, the request will still be valid. Technically health-check script is just a different for of logs that doesnt restart failed service itself)
On technical level:
- not sure if applicable to RHEL6 where manual changes to each and every init script would have to be done. I am ok doing so for RHEL7 and updating systemd config only
- ideally, systemd service should be configured to restart any failed/killed/.. service several times in a row and then give up - or optionally try to restart the service with some nontrivial delay between the attempts
- essential/critical services: basically to cover "katello-service status" services
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Mimic a service failure by killing it (an example: kill qdrouterd)
2. Wait some time to allow Sat to reheal
3. Ty the failed functionality (an example: install some errata that relies on qdrouterd)
3. fails regardless of the delay in 2.
3. to succeed after some time without any intervention
Per 6.3 planning, moving out non acked bugs to the backlog
It's worth noting, Satellite team itself directly controls very few unit files. Most are shipped with the OS packages (httpd, qpid, tomcat, etc).
Created redmine issue http://projects.theforeman.org/issues/16938 from this bug
(In reply to Stephen Benjamin from comment #4)
> It's worth noting, Satellite team itself directly controls very few unit
> files. Most are shipped with the OS packages (httpd, qpid, tomcat, etc).
.. but installer can configure the other unit files as well.
Restart on failure is hiding legitimate bugs, and I'm against the idea (in general, I'm sure there's some case that makes sense). But touching other projects unit files? Not the installer's responsibility AT ALL, even if it's systemd drop-ins.
Thank you for your interest in Satellite 6. We have evaluated this request, and we do not expect this to be implemented in product in the forseeable future. We are therefore closing this out as WONTFIX. If you have any concerns about this, please feel free to contact Rich Jerrido or Bryan Kearney. Thank you.