1149832 – Reboot 1 host on a multi-clustered setup causes oVirt to loose connection to un-rebooted hosts as well

Bug 1149832 - Reboot 1 host on a multi-clustered setup causes oVirt to loose connection to un-rebooted hosts as well

Summary: Reboot 1 host on a multi-clustered setup causes oVirt to loose connection to ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Piotr Kliczewski
QA Contact:	sefi litmanovich
Docs Contact:
URL:
Whiteboard:	infra
Depends On:	1147487
Blocks:
TreeView+	depends on / blocked

Reported:	2014-10-06 17:16 UTC by Ori Gofen
Modified:	2016-05-26 01:49 UTC (History)
CC List:	12 users (show)
Fixed In Version:	vt13.4
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-02-17 17:12:14 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vdsm+engine logs (1.56 MB, application/x-bzip) 2014-10-06 17:16 UTC, Ori Gofen	no flags	Details
vdsm + engine logs (1.09 MB, application/x-gzip) 2014-12-17 16:24 UTC, sefi litmanovich	no flags	Details
View All

Description Ori Gofen 2014-10-06 17:16:51 UTC

Created attachment 944319 [details]
vdsm+engine logs

Description of problem:

This bug relates to BZ #1148688,please read description.

The next flow consists of 3 hosts, each one belongs to a different cluster + dc
and all of the dc's are initialized with 1 File domain.

It seems that oVirt looses connectivity to all 3 hosts after rebooting only one of them,due to lack of ability to process ssl message.

from engine log:

2014-10-06 19:30:20,402 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) Unable to process messages: java.nio.channels.ClosedChannelException
        at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252) [rt.jar:1.7.0_65]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295) [rt.jar:1.7.0_65]
        at org.ovirt.vdsm.jsonrpc.client.reactors.SSLEngineNioHelper.read(SSLEngineNioHelper.java:51) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.read(SSLClient.java:81) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.readBuffer(ReactorClient.java:210) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient.processIncoming(StompCommonClient.java:90) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.process(ReactorClient.java:151) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.process(SSLClient.java:115) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.processChannels(Reactor.java:86) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.run(Reactor.java:62) [vdsm-jsonrpc-java-client.jar:]


2014-10-06 19:39:30,907 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-45) Command SpmStatusVDSCommand(HostName = vdsc, HostId = 94d53be5-fa3b-4a46-9d93-3bde4cbeb2d5, storagePoolId = bfaaabfe-36b4-46bd-9803-b7bfe5fc00a4) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues'
2014-10-06 19:39:30,913 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-45) [7794fda6] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: bfaaabfe-36b4-46bd-9803-b7bfe5fc00a4 Type: StoragePool

Even after the rebooted host regains it's ip and comes up, oVirt can't regain connection,this issue can only be solved with service ovirt-engine restart
Version-Release number of selected component (if applicable):
vt4

How reproducible:
100%

Steps to Reproduce:
1.have 3xdc's,3xclusters,3xhosts,3xfile domains
2.reboot one of the hosts(in my flow camel-vdsb was the rebooted host)
3.all hosts loose connectivity

Actual results:
rebooting 1 host on a setup which consists from 3 or more initialized dc's cause oVirt to loose connectivity to all hosts,thus making the whole system un-operational,this goes on until:
1.the rebooted host comes up,regain it's ip
2.after step 1,ovirt-engine restart is executed manually. 

Expected results:
oVirt should loose connectivity to the rebooted host,and reconstruct.
all other dc should not change status.

Additional info:

Comment 1 Oved Ourfali 2014-10-07 05:00:46 UTC

Piotr - Was it already addressed in your latest SSL fixes?
If it is, move it to MODIFIED, so that it will be verified.

Comment 2 Piotr Kliczewski 2014-10-07 06:42:13 UTC

It was fixed with latest SSL fix.

Comment 4 sefi litmanovich 2014-10-23 12:21:21 UTC

Verified with rhevm-3.5.0-0.17.beta.el6ev.noarch.

reproduced according to description and got expected results (other DC-CLUSTER-HOST were not affected upon reboot, the rebooted host was back up after confirming the reboot and re activating the host).

Comment 5 sefi litmanovich 2014-12-17 16:21:40 UTC

Reproduced the bug on vt13.3 unfortunately.

My setup:
3 hosts with rh7, 2 on the the same cluster, another on a different dc-cluster.

upon reboot of one host, after some time (before the host is back up), all the hosts move to status connecting. choosing the 'confirm host has been rebooted' option fails to change that. only upon restart engine the hosts go back up.

rhevm-3.5.0-0.25.el6ev.noarch
vdsm-4.16.8.1-3.el7ev.x86_64

Comment 6 sefi litmanovich 2014-12-17 16:24:20 UTC

Created attachment 970193 [details]
vdsm + engine logs

Comment 7 Piotr Kliczewski 2014-12-17 20:18:34 UTC

Can you please verify with 3.5 latest?

Comment 8 sefi litmanovich 2014-12-18 10:33:08 UTC

Verified with vdsm-jsonrpc-java-1.0.12-1.el6ev.noarch which will be out on upcoming build.

same scenario caused only the rebooted host to go to connecting state while the other hosts stayed up.

Comment 9 Eyal Edri 2015-02-17 17:12:14 UTC

rhev 3.5.0 was released. closing.

Note You need to log in before you can comment on or make changes to this bug.