Description of problem: I tested bug #2079901 where users should be able to re-enroll certificates when they are expired and I think the main idea of that bug is to be able to recover that host with running VMs back to up after certificates are expired. Problem is that the hosts remains in state Connecting and their status won't go to Non-Responsive (where is possible to re-enroll certificates My wild guess is that affected host is still up with opened socket so engine is able to connect to that host and it will fail on the handshake When that host serves also as an SPM the environment is stuck (domains are down and it is not possible to run any new VM - running VMs will remain up) Guessing there could be two possible workarounds: 1) old one (shutdown that host and confirm that host is rebooted) 2) shutdown vdsmd service, re-enroll certificates after host is marked as Non-Responsive Relevant logs (for testing all machines were moved to the future) 2028-05-20 11:58:39,903+03 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to <AFFECTED_HOST> 2028-05-20 11:58:39,904+03 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connected to <AFFECTED_HOST>:54321 2028-05-20 11:58:39,912+03 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Unable to process messages PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed 2028-05-20 11:58:39,913+03 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-80) [] Unable to RefreshCapabilities: VDSNetworkException: VDSGenericException: VDSNetworkException: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed 2028-05-20 11:58:41,436+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-99) [] FINISH, ConnectStoragePoolVDSCommand, return: , log id: 23e59d22 2028-05-20 11:58:41,444+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-99) [] hostFromVds::selectedVds - 'host_mixed_2', spmStatus 'Free', storage pool 'golden_env_mixed', storage pool version '4.7' 2028-05-20 11:58:41,446+03 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-99) [] SPM Init: could not find reported vds or not up - pool: 'golden_env_mixed' vds_spm_id: '3' 2028-05-20 11:58:41,447+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-99) [] SPM selection - vds seems as spm 'host_mixed_3' 2028-05-20 11:58:41,449+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-99) [] START, SpmStopVDSCommand(HostName = host_mixed_3, SpmStopVDSCommandParameters:{hostId='97e56d7b-b0fc-40cb-a2dc-e8a3cde23252', storagePoolId='4334225b-4a73-48bc-ab20-15bf87cf9491'}), log id: 686ba67d 2028-05-20 11:58:41,449+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-99) [] SpmStopVDSCommand:: vds 'host_mixed_3' is in 'Connecting' status - not performing spm stop, pool id '4334225b-4a73-48bc-ab20-15bf87cf9491' 2028-05-20 11:58:41,449+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-99) [] FINISH, SpmStopVDSCommand, return: , log id: 686ba67d 2028-05-20 11:58:41,449+03 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-99) [] spm stop on spm failed, stopping spm selection! Version-Release number of selected component (if applicable): ovirt-engine-4.5.0.7-0.9.el8ev.noarch How reproducible: always Steps to Reproduce: 1. have an environment (HE in my case) with 3 hosts 2. Disable time sync on all machines -- (these steps were performed for simulating 3. Moving all machines by 1 years to the future with engine-setup (repeatedly 4x times - to simulate cert refresh of engine external certificates) 4. Host certificates should be less than year to expire 5. enroll certificates on 2 hosts (I selected non-spm hosts) 6. move all machines 1 year to future Actual results: - storage domains are in unknown state (while SPM is affected) - that non refreshed host (SPM) is in state connecting Expected results: - host should be marked as NonResponsive when multiple attempts for handshaking fails? Or allow enrolling certificates also in state "Connecting"
With expired certificate and restarting host services, host become Non-Resposive in about 3 minutes when it is in state Connecting. Installation > Enroll certificates is possible. Management > Activate will set host to UP state ( there was 15-20 seconds Non-Resposive reply ) Software Version:4.5.2-0.3.el8ev
This bugzilla is included in oVirt 4.5.2 release, published on August 10th 2022. Since the problem described in this bug report should be resolved in oVirt 4.5.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.