Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1522916

Summary: [RFE] Use systemd to monitor vdsm
Product: [oVirt] vdsm Reporter: Piotr Kliczewski <pkliczew>
Component: CoreAssignee: Marcin Sobczyk <msobczyk>
Status: CLOSED DEFERRED QA Contact: Lukas Svaty <lsvaty>
Severity: high Docs Contact:
Priority: medium    
Version: 4.20.15CC: bugs, gveitmic, mgoldboi, mperina, msobczyk, nsoffer, pkliczew
Target Milestone: ---Keywords: FutureFeature
Target Release: ---Flags: mperina: ovirt-4.5?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-01 14:48:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Piotr Kliczewski 2017-12-06 17:50:46 UTC
We want systemd to monitor vdsm and restart if it is not responding. We would like to user systemd.daemon.notify to let systemd know that vdsm is up and running.

Comment 1 Nir Soffer 2017-12-07 12:53:50 UTC
Adding more info from the discussion on vdsm call.

Systemd notify provides two important mechanisms that we would like to use:

- startup completion detection: vdsmd will notify systemd when it has started and
  ready to accept requests. This will help other services (e.g. mom, hosted engine
  agent)to communicate with vdsmd without need to handle "connection refused"
  errors.
 
- watching vdsmd hangs: vdsmd will notify systemd watchdog periodically. If vdsmd 
  stops notifying the watchdog because of a deadlock or complete process hangup,
  or some other critical error, systemd will restart vdsmd. If vdsmd is blocked
  in D state and cannot be restarted, we will have logs about it in the journal.

General solution:

1. Use Type=notify in vdsmd.service (READY=1)

2. Notify systemd via systemd.notify python module after vdsmd has started to 
   listen on the vdsmd port.

3. Specify WatchdogSec in vdsmd.service

4. Add a health thread, checking vdsm subsystems periodically. If all subsystems
   are healthy, notify systemd watchdog using systemd.noitify python module 
   (WATCHDOG=1). If one of the subsystems is considered as not-healthy, avoid
   notifying systemd, triggering a vdsm restart.

Related docs:
- https://www.freedesktop.org/software/systemd/man/systemd.service.html#Type=
- https://www.freedesktop.org/software/systemd/man/systemd.service.html#WatchdogSec=
- http://man7.org/linux/man-pages/man3/sd_notify.3.html

Comment 2 Germano Veit Michel 2019-01-30 06:24:38 UTC
Please also consider watching supervdsmd as suggested in https://bugzilla.redhat.com/show_bug.cgi?id=1666123#c23.

And maybe other host deamons like ovirt-ha.

Comment 3 Michal Skrivanek 2020-03-19 15:41:55 UTC
We didn't get to this bug for more than 2 years, and it's not being considered for the upcoming 4.4. It's unlikely that it will ever be addressed so I'm suggesting to close it.
If you feel this needs to be addressed and want to work on it please remove cond nack and target accordingly.

Comment 4 Michal Skrivanek 2020-04-01 14:48:05 UTC
Closing old bug. Please reopen if still relevant/you want to work on it.