Created attachment 944319 [details] vdsm+engine logs Description of problem: This bug relates to BZ #1148688,please read description. The next flow consists of 3 hosts, each one belongs to a different cluster + dc and all of the dc's are initialized with 1 File domain. It seems that oVirt looses connectivity to all 3 hosts after rebooting only one of them,due to lack of ability to process ssl message. from engine log: 2014-10-06 19:30:20,402 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) Unable to process messages: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252) [rt.jar:1.7.0_65] at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295) [rt.jar:1.7.0_65] at org.ovirt.vdsm.jsonrpc.client.reactors.SSLEngineNioHelper.read(SSLEngineNioHelper.java:51) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.read(SSLClient.java:81) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.readBuffer(ReactorClient.java:210) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient.processIncoming(StompCommonClient.java:90) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.process(ReactorClient.java:151) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.process(SSLClient.java:115) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.processChannels(Reactor.java:86) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.run(Reactor.java:62) [vdsm-jsonrpc-java-client.jar:] 2014-10-06 19:39:30,907 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-45) Command SpmStatusVDSCommand(HostName = vdsc, HostId = 94d53be5-fa3b-4a46-9d93-3bde4cbeb2d5, storagePoolId = bfaaabfe-36b4-46bd-9803-b7bfe5fc00a4) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues' 2014-10-06 19:39:30,913 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-45) [7794fda6] Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: bfaaabfe-36b4-46bd-9803-b7bfe5fc00a4 Type: StoragePool Even after the rebooted host regains it's ip and comes up, oVirt can't regain connection,this issue can only be solved with service ovirt-engine restart Version-Release number of selected component (if applicable): vt4 How reproducible: 100% Steps to Reproduce: 1.have 3xdc's,3xclusters,3xhosts,3xfile domains 2.reboot one of the hosts(in my flow camel-vdsb was the rebooted host) 3.all hosts loose connectivity Actual results: rebooting 1 host on a setup which consists from 3 or more initialized dc's cause oVirt to loose connectivity to all hosts,thus making the whole system un-operational,this goes on until: 1.the rebooted host comes up,regain it's ip 2.after step 1,ovirt-engine restart is executed manually. Expected results: oVirt should loose connectivity to the rebooted host,and reconstruct. all other dc should not change status. Additional info:
Piotr - Was it already addressed in your latest SSL fixes? If it is, move it to MODIFIED, so that it will be verified.
It was fixed with latest SSL fix.
Verified with rhevm-3.5.0-0.17.beta.el6ev.noarch. reproduced according to description and got expected results (other DC-CLUSTER-HOST were not affected upon reboot, the rebooted host was back up after confirming the reboot and re activating the host).
Reproduced the bug on vt13.3 unfortunately. My setup: 3 hosts with rh7, 2 on the the same cluster, another on a different dc-cluster. upon reboot of one host, after some time (before the host is back up), all the hosts move to status connecting. choosing the 'confirm host has been rebooted' option fails to change that. only upon restart engine the hosts go back up. rhevm-3.5.0-0.25.el6ev.noarch vdsm-4.16.8.1-3.el7ev.x86_64
Created attachment 970193 [details] vdsm + engine logs
Can you please verify with 3.5 latest?
Verified with vdsm-jsonrpc-java-1.0.12-1.el6ev.noarch which will be out on upcoming build. same scenario caused only the rebooted host to go to connecting state while the other hosts stayed up.
rhev 3.5.0 was released. closing.