Description of problem: When we invoke postConnect method we schedule a task which register a channel in selector but in between we may close the channel due to issues. This causes NPE which can be mitigated by checking actual state of the channel. Version-Release number of selected component (if applicable): How reproducible: Occasionly. Steps to Reproduce: Hard to reproduce. Might happen in slow network and short heartbrat interval. Additional info: 2015-12-21 18:21:41,230 INFO [org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher] (DefaultQuartzScheduler_Worker-3) [] Failed to fetch vms info for host 'buri05' - skipping VMs monitoring. 2015-12-21 18:21:41,229 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-58) [] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.lang.NullPointerException at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.createNetworkException(VdsBrokerCommand.java:157) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:120) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) [vdsbroker.jar:] at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:634) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:119) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:226) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) [:1.7.0_91] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_91] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_91] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] Caused by: java.lang.NullPointerException at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient$2.call(SSLClient.java:137) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient$2.call(SSLClient.java:133) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable.call(Retryable.java:27) [vdsm-jsonrpc-java-client.jar:] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_91] at org.ovirt.vdsm.jsonrpc.client.utils.ReactorScheduler.performPendingOperations(ReactorScheduler.java:28) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.run(Reactor.java:61) [vdsm-jsonrpc-java-client.jar:]
What flow do we need to cover for verification of this?
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
moving back to ON_QA (fixing my mistake from comment#1)
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.
In order to test it is required to check host connectivity using slow network.
Anything special I should look for? (Specific VDSM/engine errors?
There are no special errors. It is highly time dependent and if we were slow enough it could throw NPE.
As this bug does not have any specific test and has low reproducibility, tests will be run over multiple runs (host operation) on slow network. I'll provide information once I'll either find the mentioned NPE or will have enough run so I can verify this functionality. For slow connection will be used either connection of BRQ-TLV, or BRQ-BOSTON.
Tested it on multiple scenarios (deploy, move to maintenance, activate, reinstall, PM actions) over night. No NPE appeared in the logs. Moving to verified if in the feature this issue will re-appear please re-open the bug. Tested on rhevm-3.6.3-0.1.el6.noarch