1522916 – [RFE] Use systemd to monitor vdsm

Bug 1522916 - [RFE] Use systemd to monitor vdsm

Summary: [RFE] Use systemd to monitor vdsm

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Core
Sub Component:
Version:	4.20.15
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Marcin Sobczyk
QA Contact:	Lukas Svaty
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-12-06 17:50 UTC by Piotr Kliczewski
Modified:	2020-06-26 16:38 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-04-01 14:48:05 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:
Flags:	mperina: ovirt-4.5? rule-engine: planning_ack? rule-engine: devel_ack? rule-engine: testing_ack?

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1525171	0	unspecified	CLOSED	Ensure that only one VDSM instance is running	2021-02-22 00:41:40 UTC

Internal Links: 1525171

Description Piotr Kliczewski 2017-12-06 17:50:46 UTC

We want systemd to monitor vdsm and restart if it is not responding. We would like to user systemd.daemon.notify to let systemd know that vdsm is up and running.

Comment 1 Nir Soffer 2017-12-07 12:53:50 UTC

Adding more info from the discussion on vdsm call.

Systemd notify provides two important mechanisms that we would like to use:

- startup completion detection: vdsmd will notify systemd when it has started and
  ready to accept requests. This will help other services (e.g. mom, hosted engine
  agent)to communicate with vdsmd without need to handle "connection refused"
  errors.
 
- watching vdsmd hangs: vdsmd will notify systemd watchdog periodically. If vdsmd 
  stops notifying the watchdog because of a deadlock or complete process hangup,
  or some other critical error, systemd will restart vdsmd. If vdsmd is blocked
  in D state and cannot be restarted, we will have logs about it in the journal.

General solution:

1. Use Type=notify in vdsmd.service (READY=1)

2. Notify systemd via systemd.notify python module after vdsmd has started to 
   listen on the vdsmd port.

3. Specify WatchdogSec in vdsmd.service

4. Add a health thread, checking vdsm subsystems periodically. If all subsystems
   are healthy, notify systemd watchdog using systemd.noitify python module 
   (WATCHDOG=1). If one of the subsystems is considered as not-healthy, avoid
   notifying systemd, triggering a vdsm restart.

Related docs:
- https://www.freedesktop.org/software/systemd/man/systemd.service.html#Type=
- https://www.freedesktop.org/software/systemd/man/systemd.service.html#WatchdogSec=
- http://man7.org/linux/man-pages/man3/sd_notify.3.html

Comment 2 Germano Veit Michel 2019-01-30 06:24:38 UTC

Please also consider watching supervdsmd as suggested in https://bugzilla.redhat.com/show_bug.cgi?id=1666123#c23.

And maybe other host deamons like ovirt-ha.

Comment 3 Michal Skrivanek 2020-03-19 15:41:55 UTC

We didn't get to this bug for more than 2 years, and it's not being considered for the upcoming 4.4. It's unlikely that it will ever be addressed so I'm suggesting to close it.
If you feel this needs to be addressed and want to work on it please remove cond nack and target accordingly.

Comment 4 Michal Skrivanek 2020-04-01 14:48:05 UTC

Closing old bug. Please reopen if still relevant/you want to work on it.

Note You need to log in before you can comment on or make changes to this bug.