Bug 1149832 - Reboot 1 host on a multi-clustered setup causes oVirt to loose connection to un-rebooted hosts as well
Summary: Reboot 1 host on a multi-clustered setup causes oVirt to loose connection to ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.5.0
Assignee: Piotr Kliczewski
QA Contact: sefi litmanovich
URL:
Whiteboard: infra
Depends On: 1147487
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-10-06 17:16 UTC by Ori Gofen
Modified: 2016-05-26 01:49 UTC (History)
12 users (show)

Fixed In Version: vt13.4
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-02-17 17:12:14 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
vdsm+engine logs (1.56 MB, application/x-bzip)
2014-10-06 17:16 UTC, Ori Gofen
no flags Details
vdsm + engine logs (1.09 MB, application/x-gzip)
2014-12-17 16:24 UTC, sefi litmanovich
no flags Details

Description Ori Gofen 2014-10-06 17:16:51 UTC
Created attachment 944319 [details]
vdsm+engine logs

Description of problem:

This bug relates to BZ #1148688,please read description.

The next flow consists of 3 hosts, each one belongs to a different cluster + dc
and all of the dc's are initialized with 1 File domain.

It seems that oVirt looses connectivity to all 3 hosts after rebooting only one of them,due to lack of ability to process ssl message.

from engine log:

2014-10-06 19:30:20,402 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) Unable to process messages: java.nio.channels.ClosedChannelException
        at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252) [rt.jar:1.7.0_65]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295) [rt.jar:1.7.0_65]
        at org.ovirt.vdsm.jsonrpc.client.reactors.SSLEngineNioHelper.read(SSLEngineNioHelper.java:51) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.read(SSLClient.java:81) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.readBuffer(ReactorClient.java:210) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient.processIncoming(StompCommonClient.java:90) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.process(ReactorClient.java:151) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.process(SSLClient.java:115) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.processChannels(Reactor.java:86) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.run(Reactor.java:62) [vdsm-jsonrpc-java-client.jar:]


2014-10-06 19:39:30,907 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-45) Command SpmStatusVDSCommand(HostName = vdsc, HostId = 94d53be5-fa3b-4a46-9d93-3bde4cbeb2d5, storagePoolId = bfaaabfe-36b4-46bd-9803-b7bfe5fc00a4) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues'
2014-10-06 19:39:30,913 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-45) [7794fda6] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: bfaaabfe-36b4-46bd-9803-b7bfe5fc00a4 Type: StoragePool

Even after the rebooted host regains it's ip and comes up, oVirt can't regain connection,this issue can only be solved with service ovirt-engine restart
Version-Release number of selected component (if applicable):
vt4

How reproducible:
100%

Steps to Reproduce:
1.have 3xdc's,3xclusters,3xhosts,3xfile domains
2.reboot one of the hosts(in my flow camel-vdsb was the rebooted host)
3.all hosts loose connectivity

Actual results:
rebooting 1 host on a setup which consists from 3 or more initialized dc's cause oVirt to loose connectivity to all hosts,thus making the whole system un-operational,this goes on until:
1.the rebooted host comes up,regain it's ip
2.after step 1,ovirt-engine restart is executed manually. 

Expected results:
oVirt should loose connectivity to the rebooted host,and reconstruct.
all other dc should not change status.

Additional info:

Comment 1 Oved Ourfali 2014-10-07 05:00:46 UTC
Piotr - Was it already addressed in your latest SSL fixes?
If it is, move it to MODIFIED, so that it will be verified.

Comment 2 Piotr Kliczewski 2014-10-07 06:42:13 UTC
It was fixed with latest SSL fix.

Comment 4 sefi litmanovich 2014-10-23 12:21:21 UTC
Verified with rhevm-3.5.0-0.17.beta.el6ev.noarch.

reproduced according to description and got expected results (other DC-CLUSTER-HOST were not affected upon reboot, the rebooted host was back up after confirming the reboot and re activating the host).

Comment 5 sefi litmanovich 2014-12-17 16:21:40 UTC
Reproduced the bug on vt13.3 unfortunately.

My setup:
3 hosts with rh7, 2 on the the same cluster, another on a different dc-cluster.

upon reboot of one host, after some time (before the host is back up), all the hosts move to status connecting. choosing the 'confirm host has been rebooted' option fails to change that. only upon restart engine the hosts go back up.

rhevm-3.5.0-0.25.el6ev.noarch
vdsm-4.16.8.1-3.el7ev.x86_64

Comment 6 sefi litmanovich 2014-12-17 16:24:20 UTC
Created attachment 970193 [details]
vdsm + engine logs

Comment 7 Piotr Kliczewski 2014-12-17 20:18:34 UTC
Can you please verify with 3.5 latest?

Comment 8 sefi litmanovich 2014-12-18 10:33:08 UTC
Verified with vdsm-jsonrpc-java-1.0.12-1.el6ev.noarch which will be out on upcoming build.

same scenario caused only the rebooted host to go to connecting state while the other hosts stayed up.

Comment 9 Eyal Edri 2015-02-17 17:12:14 UTC
rhev 3.5.0 was released. closing.


Note You need to log in before you can comment on or make changes to this bug.