Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2074091

Summary: Host hangs in Unavailable/Unassigned state after connection issues
Product: [oVirt] vdsm-jsonrpc-java Reporter: Jean-Louis Dupond <jean-louis>
Component: CoreAssignee: bugs <bugs>
Status: CLOSED CURRENTRELEASE QA Contact: Pavol Brilla <pbrilla>
Severity: high Docs Contact:
Priority: high    
Version: 1.7.1CC: bugs, mperina
Target Milestone: ovirt-4.5.2Keywords: TestOnly
Target Release: 1.7.2Flags: mperina: ovirt-4.5+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.5.2 vdsm-jsonrpc-java-1.7.2 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-08 16:09:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2090645    
Bug Blocks:    

Description Jean-Louis Dupond 2022-04-11 14:10:54 UTC
Description of problem:
When connection was lost to a host due to various reasons (reboot/network issues/etc), oVirt seems to be unable to reconnect to the host sometimes.
Even when you reboot the host via SSH Manager in oVirt, it gets rebooted, but the VDSM connection is never re-established.

How reproducible:
Sometimes

Steps to Reproduce:
1. Break connection to the host
2. Allow the connection again
3. oVirt tries to reconnect, but fails in some cases

Actual results:
The connection seems to hang in some lock state, and never recovers.
The only way to recover is to restart ovirt-engine.

Expected results:
The connection should timeout and connect again.


Additional info:

2022-04-11 08:57:14,286+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connection timeout for host 'xxxx', last response arrived 1526 ms ago.
-> Here is when I rebooted the host.


2022-04-11 08:57:16,709+02 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /xxxxx
-> It tried to reconnect, but seems like connection never succeeded.


An java stacktrace shows the following:
"SSL Stomp Reactor" #156 daemon prio=5 os_prio=0 cpu=25159253.94ms elapsed=966518.59s allocated=1867G defined_classes=236 tid=0x00005606f4db6000 nid=0xf9b7d runnable  [0x00007fcbc5026000]
   java.lang.Thread.State: RUNNABLE
        at java.util.Arrays.hashCode(java.base.14/Arrays.java:4569)
        at sun.security.util.ObjectIdentifier.hashCode(java.base.14/ObjectIdentifier.java:420)
        at java.util.HashMap.hash(java.base.14/HashMap.java:340)
        at java.util.HashMap.get(java.base.14/HashMap.java:553)
        at sun.security.x509.AlgorithmId.getName(java.base.14/AlgorithmId.java:259)
        at sun.security.rsa.RSAPrivateCrtKeyImpl.getAlgorithm(java.base.14/RSAPrivateCrtKeyImpl.java:185)
        at sun.security.ssl.SSLSessionImpl.isLocalAuthenticationValid(java.base.14/SSLSessionImpl.java:405)
        at sun.security.ssl.SSLSessionImpl.isRejoinable(java.base.14/SSLSessionImpl.java:387)
        at sun.security.ssl.SSLSessionImpl.isValid(java.base.14/SSLSessionImpl.java:392)
        - locked <0x000000071b9a4990> (a sun.security.ssl.SSLSessionImpl)
        at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.getPeerCertificates(SSLClient.java:160)
        at org.ovirt.vdsm.jsonrpc.client.reactors.SSLEngineNioHelper.process(SSLEngineNioHelper.java:132)
        at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.pendingOperations(SSLClient.java:83)
        at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.process(SSLClient.java:106)
        at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.lambda$processChannels$1(Reactor.java:79)
        at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor$$Lambda$1351/0x0000000841416040.accept(Unknown Source)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(java.base.14/ForEachOps.java:183)
        at java.util.stream.ReferencePipeline$2$1.accept(java.base.14/ReferencePipeline.java:177)
        at java.util.stream.ReferencePipeline$2$1.accept(java.base.14/ReferencePipeline.java:177)
        at java.util.Iterator.forEachRemaining(java.base.14/Iterator.java:133)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(java.base.14/Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(java.base.14/AbstractPipeline.java:484)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(java.base.14/AbstractPipeline.java:474)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(java.base.14/ForEachOps.java:150)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(java.base.14/ForEachOps.java:173)
        at java.util.stream.AbstractPipeline.evaluate(java.base.14/AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEach(java.base.14/ReferencePipeline.java:497)
        at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.processChannels(Reactor.java:75)
        at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.run(Reactor.java:64)

   Locked ownable synchronizers:
        - None


It seems like its hanging forever on https://github.com/oVirt/vdsm-jsonrpc-java/blob/master/client/src/main/java/org/ovirt/vdsm/jsonrpc/client/reactors/SSLClient.java#L160

Comment 1 RHEL Program Management 2022-04-11 14:58:44 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 2 Artur Socha 2022-07-13 08:27:49 UTC
I believe this issue will be solved once https://bugzilla.redhat.com/show_bug.cgi?id=2090645  is done. 
The reason for the failure is a bit different, but the proposed solution should handle both cases because, what happens now, is that connection is being established even though ssl handshake is not yet completed. However, any  ssl handshake failure does not cause a host to be moved into non-responsive state which would then allow for proper re-connect. That is just the theory which I will try to prove soon.

Comment 3 Artur Socha 2022-07-28 13:56:36 UTC
I was unable to recreate these scenario on my dev env. However, after another round of investigation I am even more confident that this issue will be resolved by BZ2090645. In the provided stack trace the failure seems to be related with ssh handshake. If I am right then such host will either be reconnected (successfully) or will enter into Non-responsive state after re-try limit is exceeded. 
I suggest to re-test the issue on QA env with ovirt-engine 4.5.2 and vdsm-jsonrpc-java 1.7.2 or simply close it as duplicate and re-open if the issue re-appears.

Comment 5 Pavol Brilla 2022-09-07 06:33:09 UTC
Was not able to reproduce on QA envs.

Host in 4.5.2 after approximately of 3 minutes of trying in *Connecting* state will fall down to *Non-Responsive* and user should be able to *Management > Activate*

Adding resolution from BZ#2090645.

With expired certificate and restarting host services, host become Non-Resposive in about 3 minutes when it is in state Connecting.

Installation > Enroll certificates is possible.

Management > Activate will set host to UP state ( there was 15-20 seconds Non-Resposive reply )

Comment 6 Casper (RHV QE bot) 2022-09-07 07:00:56 UTC
This bug has low overall severity and is not going to be further verified by QE. If you believe special care is required, feel free to properly align relevant severity, flags and keywords to raise PM_Score or use one of the Bumps ('PrioBumpField', 'PrioBumpGSS', 'PrioBumpPM', 'PrioBumpQA') in Keywords to raise it's PM_Score above verification threashold (1000).