Bug 1149832

Summary:

Reboot 1 host on a multi-clustered setup causes oVirt to loose connection to un-rebooted hosts as well

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Ori Gofen <ogofen>

Component:

ovirt-engine

Assignee:

Piotr Kliczewski <pkliczew>

Status:

CLOSED CURRENTRELEASE

QA Contact:

sefi litmanovich <slitmano>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

3.5.0

CC:

acanan, ecohen, gklein, iheim, lpeer, lsurette, oourfali, rbalakri, Rhev-m-bugs, sherold, slitmano, yeylon

Target Milestone:

---

Keywords:

Regression

Target Release:

3.5.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

infra

Fixed In Version:

vt13.4

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-02-17 17:12:14 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Infra

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1147487

Bug Blocks:

Attachments:

Description	Flags
vdsm+engine logs	none
vdsm + engine logs	none

Description Ori Gofen 2014-10-06 17:16:51 UTC

Created attachment 944319 [details]
vdsm+engine logs

Description of problem:

This bug relates to BZ #1148688,please read description.

The next flow consists of 3 hosts, each one belongs to a different cluster + dc
and all of the dc's are initialized with 1 File domain.

It seems that oVirt looses connectivity to all 3 hosts after rebooting only one of them,due to lack of ability to process ssl message.

from engine log:

2014-10-06 19:30:20,402 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) Unable to process messages: java.nio.channels.ClosedChannelException
        at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252) [rt.jar:1.7.0_65]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295) [rt.jar:1.7.0_65]
        at org.ovirt.vdsm.jsonrpc.client.reactors.SSLEngineNioHelper.read(SSLEngineNioHelper.java:51) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.read(SSLClient.java:81) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.readBuffer(ReactorClient.java:210) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient.processIncoming(StompCommonClient.java:90) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.process(ReactorClient.java:151) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.process(SSLClient.java:115) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.processChannels(Reactor.java:86) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.run(Reactor.java:62) [vdsm-jsonrpc-java-client.jar:]


2014-10-06 19:39:30,907 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-45) Command SpmStatusVDSCommand(HostName = vdsc, HostId = 94d53be5-fa3b-4a46-9d93-3bde4cbeb2d5, storagePoolId = bfaaabfe-36b4-46bd-9803-b7bfe5fc00a4) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues'
2014-10-06 19:39:30,913 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-45) [7794fda6] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: bfaaabfe-36b4-46bd-9803-b7bfe5fc00a4 Type: StoragePool

Even after the rebooted host regains it's ip and comes up, oVirt can't regain connection,this issue can only be solved with service ovirt-engine restart
Version-Release number of selected component (if applicable):
vt4

How reproducible:
100%

Steps to Reproduce:
1.have 3xdc's,3xclusters,3xhosts,3xfile domains
2.reboot one of the hosts(in my flow camel-vdsb was the rebooted host)
3.all hosts loose connectivity

Actual results:
rebooting 1 host on a setup which consists from 3 or more initialized dc's cause oVirt to loose connectivity to all hosts,thus making the whole system un-operational,this goes on until:
1.the rebooted host comes up,regain it's ip
2.after step 1,ovirt-engine restart is executed manually. 

Expected results:
oVirt should loose connectivity to the rebooted host,and reconstruct.
all other dc should not change status.

Additional info:

Comment 1 Oved Ourfali 2014-10-07 05:00:46 UTC

Piotr - Was it already addressed in your latest SSL fixes?
If it is, move it to MODIFIED, so that it will be verified.

Comment 2 Piotr Kliczewski 2014-10-07 06:42:13 UTC

It was fixed with latest SSL fix.

Comment 4 sefi litmanovich 2014-10-23 12:21:21 UTC

Verified with rhevm-3.5.0-0.17.beta.el6ev.noarch.

reproduced according to description and got expected results (other DC-CLUSTER-HOST were not affected upon reboot, the rebooted host was back up after confirming the reboot and re activating the host).

Comment 5 sefi litmanovich 2014-12-17 16:21:40 UTC

Reproduced the bug on vt13.3 unfortunately.

My setup:
3 hosts with rh7, 2 on the the same cluster, another on a different dc-cluster.

upon reboot of one host, after some time (before the host is back up), all the hosts move to status connecting. choosing the 'confirm host has been rebooted' option fails to change that. only upon restart engine the hosts go back up.

rhevm-3.5.0-0.25.el6ev.noarch
vdsm-4.16.8.1-3.el7ev.x86_64

Comment 6 sefi litmanovich 2014-12-17 16:24:20 UTC

Created attachment 970193 [details]
vdsm + engine logs

Comment 7 Piotr Kliczewski 2014-12-17 20:18:34 UTC

Can you please verify with 3.5 latest?

Comment 8 sefi litmanovich 2014-12-18 10:33:08 UTC

Verified with vdsm-jsonrpc-java-1.0.12-1.el6ev.noarch which will be out on upcoming build.

same scenario caused only the rebooted host to go to connecting state while the other hosts stayed up.

Comment 9 Eyal Edri 2015-02-17 17:12:14 UTC

rhev 3.5.0 was released. closing.