Bug 2074091
| Summary: | Host hangs in Unavailable/Unassigned state after connection issues | ||
|---|---|---|---|
| Product: | [oVirt] vdsm-jsonrpc-java | Reporter: | Jean-Louis Dupond <jean-louis> |
| Component: | Core | Assignee: | bugs <bugs> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Pavol Brilla <pbrilla> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 1.7.1 | CC: | bugs, mperina |
| Target Milestone: | ovirt-4.5.2 | Keywords: | TestOnly |
| Target Release: | 1.7.2 | Flags: | mperina:
ovirt-4.5+
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | ovirt-engine-4.5.2 vdsm-jsonrpc-java-1.7.2 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-09-08 16:09:49 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2090645 | ||
| Bug Blocks: | |||
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again. I believe this issue will be solved once https://bugzilla.redhat.com/show_bug.cgi?id=2090645 is done. The reason for the failure is a bit different, but the proposed solution should handle both cases because, what happens now, is that connection is being established even though ssl handshake is not yet completed. However, any ssl handshake failure does not cause a host to be moved into non-responsive state which would then allow for proper re-connect. That is just the theory which I will try to prove soon. I was unable to recreate these scenario on my dev env. However, after another round of investigation I am even more confident that this issue will be resolved by BZ2090645. In the provided stack trace the failure seems to be related with ssh handshake. If I am right then such host will either be reconnected (successfully) or will enter into Non-responsive state after re-try limit is exceeded. I suggest to re-test the issue on QA env with ovirt-engine 4.5.2 and vdsm-jsonrpc-java 1.7.2 or simply close it as duplicate and re-open if the issue re-appears. Was not able to reproduce on QA envs. Host in 4.5.2 after approximately of 3 minutes of trying in *Connecting* state will fall down to *Non-Responsive* and user should be able to *Management > Activate* Adding resolution from BZ#2090645. With expired certificate and restarting host services, host become Non-Resposive in about 3 minutes when it is in state Connecting. Installation > Enroll certificates is possible. Management > Activate will set host to UP state ( there was 15-20 seconds Non-Resposive reply ) This bug has low overall severity and is not going to be further verified by QE. If you believe special care is required, feel free to properly align relevant severity, flags and keywords to raise PM_Score or use one of the Bumps ('PrioBumpField', 'PrioBumpGSS', 'PrioBumpPM', 'PrioBumpQA') in Keywords to raise it's PM_Score above verification threashold (1000).
|
Description of problem: When connection was lost to a host due to various reasons (reboot/network issues/etc), oVirt seems to be unable to reconnect to the host sometimes. Even when you reboot the host via SSH Manager in oVirt, it gets rebooted, but the VDSM connection is never re-established. How reproducible: Sometimes Steps to Reproduce: 1. Break connection to the host 2. Allow the connection again 3. oVirt tries to reconnect, but fails in some cases Actual results: The connection seems to hang in some lock state, and never recovers. The only way to recover is to restart ovirt-engine. Expected results: The connection should timeout and connect again. Additional info: 2022-04-11 08:57:14,286+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connection timeout for host 'xxxx', last response arrived 1526 ms ago. -> Here is when I rebooted the host. 2022-04-11 08:57:16,709+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /xxxxx -> It tried to reconnect, but seems like connection never succeeded. An java stacktrace shows the following: "SSL Stomp Reactor" #156 daemon prio=5 os_prio=0 cpu=25159253.94ms elapsed=966518.59s allocated=1867G defined_classes=236 tid=0x00005606f4db6000 nid=0xf9b7d runnable [0x00007fcbc5026000] java.lang.Thread.State: RUNNABLE at java.util.Arrays.hashCode(java.base.14/Arrays.java:4569) at sun.security.util.ObjectIdentifier.hashCode(java.base.14/ObjectIdentifier.java:420) at java.util.HashMap.hash(java.base.14/HashMap.java:340) at java.util.HashMap.get(java.base.14/HashMap.java:553) at sun.security.x509.AlgorithmId.getName(java.base.14/AlgorithmId.java:259) at sun.security.rsa.RSAPrivateCrtKeyImpl.getAlgorithm(java.base.14/RSAPrivateCrtKeyImpl.java:185) at sun.security.ssl.SSLSessionImpl.isLocalAuthenticationValid(java.base.14/SSLSessionImpl.java:405) at sun.security.ssl.SSLSessionImpl.isRejoinable(java.base.14/SSLSessionImpl.java:387) at sun.security.ssl.SSLSessionImpl.isValid(java.base.14/SSLSessionImpl.java:392) - locked <0x000000071b9a4990> (a sun.security.ssl.SSLSessionImpl) at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.getPeerCertificates(SSLClient.java:160) at org.ovirt.vdsm.jsonrpc.client.reactors.SSLEngineNioHelper.process(SSLEngineNioHelper.java:132) at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.pendingOperations(SSLClient.java:83) at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.process(SSLClient.java:106) at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.lambda$processChannels$1(Reactor.java:79) at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor$$Lambda$1351/0x0000000841416040.accept(Unknown Source) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(java.base.14/ForEachOps.java:183) at java.util.stream.ReferencePipeline$2$1.accept(java.base.14/ReferencePipeline.java:177) at java.util.stream.ReferencePipeline$2$1.accept(java.base.14/ReferencePipeline.java:177) at java.util.Iterator.forEachRemaining(java.base.14/Iterator.java:133) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(java.base.14/Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(java.base.14/AbstractPipeline.java:484) at java.util.stream.AbstractPipeline.wrapAndCopyInto(java.base.14/AbstractPipeline.java:474) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(java.base.14/ForEachOps.java:150) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(java.base.14/ForEachOps.java:173) at java.util.stream.AbstractPipeline.evaluate(java.base.14/AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(java.base.14/ReferencePipeline.java:497) at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.processChannels(Reactor.java:75) at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.run(Reactor.java:64) Locked ownable synchronizers: - None It seems like its hanging forever on https://github.com/oVirt/vdsm-jsonrpc-java/blob/master/client/src/main/java/org/ovirt/vdsm/jsonrpc/client/reactors/SSLClient.java#L160