Bug 2079901 - allow Enroll Certificate action when host is Non Responsive
Summary: allow Enroll Certificate action when host is Non Responsive
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: General
Version: ---
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.5.0-1
: 4.5.0.7
Assignee: Milan Zamazal
QA Contact: Petr Kubica
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-28 13:05 UTC by Michal Skrivanek
Modified: 2022-05-30 06:42 UTC (History)
5 users (show)

Fixed In Version: ovirt-engine-4.5.0.7
Clone Of:
Environment:
Last Closed: 2022-05-30 06:42:37 UTC
oVirt Team: Infra
Embargoed:
pm-rhel: ovirt-4.5?
lsvaty: exception+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 324 0 None open Improve certificate renewal 2022-05-02 17:30:42 UTC
Github oVirt ovirt-engine pull 347 0 None open Backport CA and certificate fixes to 4.5.0.z 2022-05-09 13:41:28 UTC
Red Hat Issue Tracker RHV-45883 0 None None None 2022-04-28 14:10:58 UTC

Description Michal Skrivanek 2022-04-28 13:05:27 UTC
When certificates expire there is no way how to re-enroll without shutting down all the running VMs. The Enroll is possible only in Maintenance state and the only way how to get to Maintenance state from Non Responsive with running VMs is through "confirm host has been rebooted" - i.e. either reboot it or knowingly override the actual state triggering VM restart of HA VMs elsewhere and potentially causing split brain.

We cannot fully re-enroll all certificates on a running system, but we can install new certificates and restart subset of services that allow VMs to be migrated away and move the host to proper Maintenance.

Let's
- allow the Enroll Certificate operation when host is in Non Responsive state
- run the same enrollment code as in Maintenance
- restart/reload libvirt (which in turn restarts vdsm), imageio, and OVN

It introduces a small risk for losing track of ongoing actions (e.g. live merge completion) but we should mostly deal with those gracefully. Not all operations would work (e.g. VNC certificate cannot be reloaded so console connections will not be possible), but being able to control the host and lifecycle of VMs has more priority.

Still, it shall be documented that this is a "desperate measure" for cases where certificates suddenly expire and there's no other way out. Such action must be followed by another re-enroll during Maintenance, and such running VMs *must* be restarted before they are fully functional again.

Comment 1 Petr Kubica 2022-05-24 06:40:43 UTC
verified in ovirt-engine-4.5.0.7-0.9.el8ev.noarch

Verification steps
Host was put into a NonResponsive state after killing vdsmd service. (with and without running VMs on top of the hosts)
Triggered enrolling certificate on the host which successfully completed


Note You need to log in before you can comment on or make changes to this bug.