Bug 1522916 - [RFE] Use systemd to monitor vdsm
Summary: [RFE] Use systemd to monitor vdsm
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.20.15
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: ---
Assignee: Marcin Sobczyk
QA Contact: Lukas Svaty
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-12-06 17:50 UTC by Piotr Kliczewski
Modified: 2020-06-26 16:38 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-01 14:48:05 UTC
oVirt Team: Infra
Embargoed:
mperina: ovirt-4.5?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1525171 0 unspecified CLOSED Ensure that only one VDSM instance is running 2021-02-22 00:41:40 UTC

Internal Links: 1525171

Description Piotr Kliczewski 2017-12-06 17:50:46 UTC
We want systemd to monitor vdsm and restart if it is not responding. We would like to user systemd.daemon.notify to let systemd know that vdsm is up and running.

Comment 1 Nir Soffer 2017-12-07 12:53:50 UTC
Adding more info from the discussion on vdsm call.

Systemd notify provides two important mechanisms that we would like to use:

- startup completion detection: vdsmd will notify systemd when it has started and
  ready to accept requests. This will help other services (e.g. mom, hosted engine
  agent)to communicate with vdsmd without need to handle "connection refused"
  errors.
 
- watching vdsmd hangs: vdsmd will notify systemd watchdog periodically. If vdsmd 
  stops notifying the watchdog because of a deadlock or complete process hangup,
  or some other critical error, systemd will restart vdsmd. If vdsmd is blocked
  in D state and cannot be restarted, we will have logs about it in the journal.

General solution:

1. Use Type=notify in vdsmd.service (READY=1)

2. Notify systemd via systemd.notify python module after vdsmd has started to 
   listen on the vdsmd port.

3. Specify WatchdogSec in vdsmd.service

4. Add a health thread, checking vdsm subsystems periodically. If all subsystems
   are healthy, notify systemd watchdog using systemd.noitify python module 
   (WATCHDOG=1). If one of the subsystems is considered as not-healthy, avoid
   notifying systemd, triggering a vdsm restart.

Related docs:
- https://www.freedesktop.org/software/systemd/man/systemd.service.html#Type=
- https://www.freedesktop.org/software/systemd/man/systemd.service.html#WatchdogSec=
- http://man7.org/linux/man-pages/man3/sd_notify.3.html

Comment 2 Germano Veit Michel 2019-01-30 06:24:38 UTC
Please also consider watching supervdsmd as suggested in https://bugzilla.redhat.com/show_bug.cgi?id=1666123#c23.

And maybe other host deamons like ovirt-ha.

Comment 3 Michal Skrivanek 2020-03-19 15:41:55 UTC
We didn't get to this bug for more than 2 years, and it's not being considered for the upcoming 4.4. It's unlikely that it will ever be addressed so I'm suggesting to close it.
If you feel this needs to be addressed and want to work on it please remove cond nack and target accordingly.

Comment 4 Michal Skrivanek 2020-04-01 14:48:05 UTC
Closing old bug. Please reopen if still relevant/you want to work on it.


Note You need to log in before you can comment on or make changes to this bug.