Bug 1382691

Summary: Host is marked as non responsive after upgrade
Product: [oVirt] ovirt-engine Reporter: Roman Hodain <rhodain>
Component: BLL.InfraAssignee: Moti Asayag <masayag>
Status: CLOSED CURRENTRELEASE QA Contact: Petr Kubica <pkubica>
Severity: medium Docs Contact:
Priority: high    
Version: ---CC: bugs, masayag, mgoldboi, mperina, oourfali, rhodain
Target Milestone: ovirt-4.1.0-rcKeywords: ZStream
Target Release: 4.1.0Flags: rule-engine: ovirt-4.1+
rule-engine: planning_ack+
masayag: devel_ack+
pstehlik: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1383229 (view as bug list) Environment:
Last Closed: 2017-02-01 14:39:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1383229    

Description Roman Hodain 2016-10-07 12:23:41 UTC
Description of problem:
    When a host is upgraded without putting the host to maintenance mode. The event reports that the host is non responsive 

    It is just a temporary state and the host is flipped to up state after some time.

Version-Release number of selected component (if applicable):

    ovirt-host-deploy-java-1.5.2-1.el7ev.noarch

How reproducible:

    100%

Steps to Reproduce:
1. Do not put host to maintenance mode
2. Click Upgrade

Actual results:

Event log reports:

    VDSM ..... command failed: Heartbeat exeeded

    or

    VDSM commadn failed: Connection reset by peer.


Expected results:

    No error is reported.

Comment 3 Moti Asayag 2016-10-09 07:08:25 UTC
Providing some information of the bug:

When clicking the 'Upgrade' command when host is up, the engine issues the 'Maintenance' command internally and monitors the host status this it become 'Maintenance'. Only then the engine starts the upgrade process, same as the user would have do. The only difference is at the end of upgrade - we attempt to move the host back to 'Up' state. If the host was already set by the user to 'Maintenance' mode, the upgrade process would set the host to 'Maintenance' mode once it is completed.

By looking at the code it seems that the upgrade process is completed successfully, but bringing the host back up takes some time - and this is done after updating the host status to 'initializing' as the last step of the upgrade command. From that point, the host monitoring is responsible for the host life-cycle:

2016-10-07 14:01:30,914 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-5-thread-3) 
[27f60906] Correlation ID: 27f60906, Call Stack: null, Custom Event ID: -1, Message: Host rhodain-rhel7_host01 upgrade was completed successfully.
2016-10-07 14:01:39,601 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Internal server error: null
2016-10-07 14:01:39,606 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [27f60906] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhodain-rhel7_host01 command failed: Heartbeat exeeded
...
2016-10-07 14:01:39,607 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler10) [27f60906] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-07 14:01:39,607 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler10) [27f60906] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded


There are no details about the version of the engine and the type of the host.
I suspect the chances for hitting this issue increase with new patch regarding hosted-engine recent change [1] where the host is accessed before it is confirmed to be reachable by the engine. This is not the case from the bug though, since the log doesn't contain the HA command traces in the log ('SetHaMaintenanceMode').

[1] https://gerrit.ovirt.org/#/c/63533/3/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/hostdeploy/UpgradeHostInternalCommand.java

Comment 4 Oved Ourfali 2016-10-10 05:04:13 UTC
Roman, can you give version details as Moti requested?

Comment 6 Moti Asayag 2016-11-28 07:12:03 UTC
Providing additional information for the bug:

Reproducing the flow on my env with debug mode for vdsm-json-rpc has indicated a NullPointerException in vdsm-json-rpc-java client, as a result of a closed channel.

<JsonRpcRequest id: "86502f54-0b9f-46b1-b05b-913cb60ff5ac", method: Host.getAllVmStats, params: {}>
2016-11-28 08:47:04,735 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Heartbeat exceeded. Closing channel
2016-11-28 08:47:04,787 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc":"2.0","error":{"code":"zeus05.eng.lab.tlv.redhat.com:1976388753","message":"Heartbeat exceeded"},"id":null}
2016-11-28 08:47:04,787 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Internal server error: null: java.lang.NullPointerException
	at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.write(SSLClient.java:102) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.processOutgoing(ReactorClient.java:245) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.process(ReactorClient.java:208) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.process(SSLClient.java:125) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.processChannels(Reactor.java:89) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.run(Reactor.java:65) [vdsm-jsonrpc-java-client.jar:]

2016-11-28 08:47:04,791 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc":"2.0","error":{"code":"zeus05.eng.lab.tlv.redhat.com:","message":"Internal server error"},"id":null}

Comment 7 Martin Perina 2016-11-30 16:26:02 UTC
Back to POST as we need to merge 4.0 backports

Comment 8 Sandro Bonazzola 2017-01-25 07:54:49 UTC
4.0.6 has been the last oVirt 4.0 release, please re-target this bug.

Comment 9 Petr Kubica 2017-01-31 11:04:17 UTC
Verified in 4.1.0-0.3.beta2.el7