Bug 1149832

Summary: Reboot 1 host on a multi-clustered setup causes oVirt to loose connection to un-rebooted hosts as well
Product: Red Hat Enterprise Virtualization Manager Reporter: Ori Gofen <ogofen>
Component: ovirt-engineAssignee: Piotr Kliczewski <pkliczew>
Status: CLOSED CURRENTRELEASE QA Contact: sefi litmanovich <slitmano>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.5.0CC: acanan, ecohen, gklein, iheim, lpeer, lsurette, oourfali, rbalakri, Rhev-m-bugs, sherold, slitmano, yeylon
Target Milestone: ---Keywords: Regression
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: vt13.4 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-17 17:12:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1147487    
Bug Blocks:    
Attachments:
Description Flags
vdsm+engine logs
none
vdsm + engine logs none

Description Ori Gofen 2014-10-06 17:16:51 UTC
Created attachment 944319 [details]
vdsm+engine logs

Description of problem:

This bug relates to BZ #1148688,please read description.

The next flow consists of 3 hosts, each one belongs to a different cluster + dc
and all of the dc's are initialized with 1 File domain.

It seems that oVirt looses connectivity to all 3 hosts after rebooting only one of them,due to lack of ability to process ssl message.

from engine log:

2014-10-06 19:30:20,402 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) Unable to process messages: java.nio.channels.ClosedChannelException
        at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252) [rt.jar:1.7.0_65]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295) [rt.jar:1.7.0_65]
        at org.ovirt.vdsm.jsonrpc.client.reactors.SSLEngineNioHelper.read(SSLEngineNioHelper.java:51) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.read(SSLClient.java:81) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.readBuffer(ReactorClient.java:210) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient.processIncoming(StompCommonClient.java:90) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.process(ReactorClient.java:151) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.process(SSLClient.java:115) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.processChannels(Reactor.java:86) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.run(Reactor.java:62) [vdsm-jsonrpc-java-client.jar:]


2014-10-06 19:39:30,907 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-45) Command SpmStatusVDSCommand(HostName = vdsc, HostId = 94d53be5-fa3b-4a46-9d93-3bde4cbeb2d5, storagePoolId = bfaaabfe-36b4-46bd-9803-b7bfe5fc00a4) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues'
2014-10-06 19:39:30,913 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-45) [7794fda6] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: bfaaabfe-36b4-46bd-9803-b7bfe5fc00a4 Type: StoragePool

Even after the rebooted host regains it's ip and comes up, oVirt can't regain connection,this issue can only be solved with service ovirt-engine restart
Version-Release number of selected component (if applicable):
vt4

How reproducible:
100%

Steps to Reproduce:
1.have 3xdc's,3xclusters,3xhosts,3xfile domains
2.reboot one of the hosts(in my flow camel-vdsb was the rebooted host)
3.all hosts loose connectivity

Actual results:
rebooting 1 host on a setup which consists from 3 or more initialized dc's cause oVirt to loose connectivity to all hosts,thus making the whole system un-operational,this goes on until:
1.the rebooted host comes up,regain it's ip
2.after step 1,ovirt-engine restart is executed manually. 

Expected results:
oVirt should loose connectivity to the rebooted host,and reconstruct.
all other dc should not change status.

Additional info:

Comment 1 Oved Ourfali 2014-10-07 05:00:46 UTC
Piotr - Was it already addressed in your latest SSL fixes?
If it is, move it to MODIFIED, so that it will be verified.

Comment 2 Piotr Kliczewski 2014-10-07 06:42:13 UTC
It was fixed with latest SSL fix.

Comment 4 sefi litmanovich 2014-10-23 12:21:21 UTC
Verified with rhevm-3.5.0-0.17.beta.el6ev.noarch.

reproduced according to description and got expected results (other DC-CLUSTER-HOST were not affected upon reboot, the rebooted host was back up after confirming the reboot and re activating the host).

Comment 5 sefi litmanovich 2014-12-17 16:21:40 UTC
Reproduced the bug on vt13.3 unfortunately.

My setup:
3 hosts with rh7, 2 on the the same cluster, another on a different dc-cluster.

upon reboot of one host, after some time (before the host is back up), all the hosts move to status connecting. choosing the 'confirm host has been rebooted' option fails to change that. only upon restart engine the hosts go back up.

rhevm-3.5.0-0.25.el6ev.noarch
vdsm-4.16.8.1-3.el7ev.x86_64

Comment 6 sefi litmanovich 2014-12-17 16:24:20 UTC
Created attachment 970193 [details]
vdsm + engine logs

Comment 7 Piotr Kliczewski 2014-12-17 20:18:34 UTC
Can you please verify with 3.5 latest?

Comment 8 sefi litmanovich 2014-12-18 10:33:08 UTC
Verified with vdsm-jsonrpc-java-1.0.12-1.el6ev.noarch which will be out on upcoming build.

same scenario caused only the rebooted host to go to connecting state while the other hosts stayed up.

Comment 9 Eyal Edri 2015-02-17 17:12:14 UTC
rhev 3.5.0 was released. closing.