Description of problem: When a host is upgraded without putting the host to maintenance mode. The event reports that the host is non responsive It is just a temporary state and the host is flipped to up state after some time. Version-Release number of selected component (if applicable): ovirt-host-deploy-java-1.5.2-1.el7ev.noarch How reproducible: 100% Steps to Reproduce: 1. Do not put host to maintenance mode 2. Click Upgrade Actual results: Event log reports: VDSM ..... command failed: Heartbeat exeeded or VDSM commadn failed: Connection reset by peer. Expected results: No error is reported.
Providing some information of the bug: When clicking the 'Upgrade' command when host is up, the engine issues the 'Maintenance' command internally and monitors the host status this it become 'Maintenance'. Only then the engine starts the upgrade process, same as the user would have do. The only difference is at the end of upgrade - we attempt to move the host back to 'Up' state. If the host was already set by the user to 'Maintenance' mode, the upgrade process would set the host to 'Maintenance' mode once it is completed. By looking at the code it seems that the upgrade process is completed successfully, but bringing the host back up takes some time - and this is done after updating the host status to 'initializing' as the last step of the upgrade command. From that point, the host monitoring is responsible for the host life-cycle: 2016-10-07 14:01:30,914 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-5-thread-3) [27f60906] Correlation ID: 27f60906, Call Stack: null, Custom Event ID: -1, Message: Host rhodain-rhel7_host01 upgrade was completed successfully. 2016-10-07 14:01:39,601 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Internal server error: null 2016-10-07 14:01:39,606 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [27f60906] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhodain-rhel7_host01 command failed: Heartbeat exeeded ... 2016-10-07 14:01:39,607 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler10) [27f60906] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-07 14:01:39,607 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler10) [27f60906] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded There are no details about the version of the engine and the type of the host. I suspect the chances for hitting this issue increase with new patch regarding hosted-engine recent change [1] where the host is accessed before it is confirmed to be reachable by the engine. This is not the case from the bug though, since the log doesn't contain the HA command traces in the log ('SetHaMaintenanceMode'). [1] https://gerrit.ovirt.org/#/c/63533/3/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/hostdeploy/UpgradeHostInternalCommand.java
Roman, can you give version details as Moti requested?
Providing additional information for the bug: Reproducing the flow on my env with debug mode for vdsm-json-rpc has indicated a NullPointerException in vdsm-json-rpc-java client, as a result of a closed channel. <JsonRpcRequest id: "86502f54-0b9f-46b1-b05b-913cb60ff5ac", method: Host.getAllVmStats, params: {}> 2016-11-28 08:47:04,735 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Heartbeat exceeded. Closing channel 2016-11-28 08:47:04,787 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc":"2.0","error":{"code":"zeus05.eng.lab.tlv.redhat.com:1976388753","message":"Heartbeat exceeded"},"id":null} 2016-11-28 08:47:04,787 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Internal server error: null: java.lang.NullPointerException at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.write(SSLClient.java:102) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.processOutgoing(ReactorClient.java:245) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.process(ReactorClient.java:208) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.process(SSLClient.java:125) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.processChannels(Reactor.java:89) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.run(Reactor.java:65) [vdsm-jsonrpc-java-client.jar:] 2016-11-28 08:47:04,791 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc":"2.0","error":{"code":"zeus05.eng.lab.tlv.redhat.com:","message":"Internal server error"},"id":null}
Back to POST as we need to merge 4.0 backports
4.0.6 has been the last oVirt 4.0 release, please re-target this bug.
Verified in 4.1.0-0.3.beta2.el7