Bug 2090645 - Host stuck in state 'Connecting' when certificates expire
Summary: Host stuck in state 'Connecting' when certificates expire
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: General
Version: 4.5.0.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.5.2
: ---
Assignee: bugs@ovirt.org
QA Contact: Pavol Brilla
URL:
Whiteboard:
Depends On:
Blocks: 2074091
TreeView+ depends on / blocked
 
Reported: 2022-05-26 08:50 UTC by Petr Kubica
Modified: 2022-08-30 08:47 UTC (History)
3 users (show)

Fixed In Version: ovirt-engine-4.5.2
Clone Of:
Environment:
Last Closed: 2022-08-30 08:47:42 UTC
oVirt Team: Infra
Embargoed:
mperina: ovirt-4.5+
gdeolive: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 506 0 None Draft packaging: Bump vdsm-jsonrpc-java to 1.7.2 2022-07-06 07:27:04 UTC
Github oVirt vdsm-jsonrpc-java pull 17 0 None open SSL Cert expiration check on re-connect 2022-07-13 08:58:17 UTC
Red Hat Issue Tracker RHV-46124 0 None None None 2022-05-26 09:37:00 UTC

Description Petr Kubica 2022-05-26 08:50:31 UTC
Description of problem:
I tested bug #2079901 where users should be able to re-enroll certificates when they are expired and I think the main idea of that bug is to be able to recover that host with running VMs back to up after certificates are expired.

Problem is that the hosts remains in state Connecting and their status won't go to Non-Responsive (where is possible to re-enroll certificates

My wild guess is that affected host is still up with opened socket so engine is able to connect to that host and it will fail on the handshake

When that host serves also as an SPM the environment is stuck (domains are down and it is not possible to run any new VM - running VMs will remain up) 

Guessing there could be two possible workarounds:
1) old one (shutdown that host and confirm that host is rebooted)
2) shutdown vdsmd service, re-enroll certificates after host is marked as Non-Responsive 

Relevant logs (for testing all machines were moved to the future)
2028-05-20 11:58:39,903+03 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to <AFFECTED_HOST>
2028-05-20 11:58:39,904+03 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connected to <AFFECTED_HOST>:54321
2028-05-20 11:58:39,912+03 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Unable to process messages PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed
2028-05-20 11:58:39,913+03 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-80) [] Unable to RefreshCapabilities: VDSNetworkException: VDSGenericException: VDSNetworkException: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed
2028-05-20 11:58:41,436+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-99) [] FINISH, ConnectStoragePoolVDSCommand, return: , log id: 23e59d22
2028-05-20 11:58:41,444+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-99) [] hostFromVds::selectedVds - 'host_mixed_2', spmStatus 'Free', storage pool 'golden_env_mixed', storage pool version '4.7'
2028-05-20 11:58:41,446+03 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-99) [] SPM Init: could not find reported vds or not up - pool: 'golden_env_mixed' vds_spm_id: '3'
2028-05-20 11:58:41,447+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-99) [] SPM selection - vds seems as spm 'host_mixed_3'
2028-05-20 11:58:41,449+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-99) [] START, SpmStopVDSCommand(HostName = host_mixed_3, SpmStopVDSCommandParameters:{hostId='97e56d7b-b0fc-40cb-a2dc-e8a3cde23252', storagePoolId='4334225b-4a73-48bc-ab20-15bf87cf9491'}), log id: 686ba67d
2028-05-20 11:58:41,449+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-99) [] SpmStopVDSCommand:: vds 'host_mixed_3' is in 'Connecting' status - not performing spm stop, pool id '4334225b-4a73-48bc-ab20-15bf87cf9491'
2028-05-20 11:58:41,449+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-99) [] FINISH, SpmStopVDSCommand, return: , log id: 686ba67d
2028-05-20 11:58:41,449+03 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-99) [] spm stop on spm failed, stopping spm selection!

Version-Release number of selected component (if applicable):
ovirt-engine-4.5.0.7-0.9.el8ev.noarch

How reproducible:
always

Steps to Reproduce:
1. have an environment (HE in my case) with 3 hosts
2. Disable time sync on all machines

-- (these steps were performed for simulating 
3. Moving all machines by 1 years to the future with engine-setup (repeatedly 4x times - to simulate cert refresh of engine external certificates)
4. Host certificates should be less than year to expire
5. enroll certificates on 2 hosts (I selected non-spm hosts)
6. move all machines 1 year to future

Actual results:
- storage domains are in unknown state (while SPM is affected)
- that non refreshed host (SPM) is in state connecting

Expected results:
- host should be marked as NonResponsive when multiple attempts for handshaking fails? Or allow enrolling certificates also in state "Connecting"

Comment 4 Pavol Brilla 2022-08-11 07:34:18 UTC
With expired certificate and restarting host services, host become Non-Resposive in about 3 minutes when it is in state Connecting.

Installation > Enroll certificates is possible.

Management > Activate will set host to UP state ( there was 15-20 seconds Non-Resposive reply )

Software Version:4.5.2-0.3.el8ev

Comment 5 Sandro Bonazzola 2022-08-30 08:47:42 UTC
This bugzilla is included in oVirt 4.5.2 release, published on August 10th 2022.
Since the problem described in this bug report should be resolved in oVirt 4.5.2 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.