Bug 1382691 - Host is marked as non responsive after upgrade
Summary: Host is marked as non responsive after upgrade
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Infra
Version: ---
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ovirt-4.1.0-rc
: 4.1.0
Assignee: Moti Asayag
QA Contact: Petr Kubica
URL:
Whiteboard:
Depends On:
Blocks: 1383229
TreeView+ depends on / blocked
 
Reported: 2016-10-07 12:23 UTC by Roman Hodain
Modified: 2017-02-01 14:39 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1383229 (view as bug list)
Environment:
Last Closed: 2017-02-01 14:39:31 UTC
oVirt Team: Infra
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: planning_ack+
masayag: devel_ack+
pstehlik: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 67398 0 master MERGED stop processing if closing 2016-11-28 11:21:47 UTC
oVirt gerrit 67409 0 ovirt-engine-4.0 MERGED engine: Test host connectivity after upgrade 2016-11-29 10:08:41 UTC
oVirt gerrit 67432 0 master MERGED engine: Test host connectivity after upgrade 2016-11-28 15:10:38 UTC
oVirt gerrit 67554 0 master MERGED engine: ovirt-node should wait for restart after upgrade 2016-11-30 15:53:21 UTC
oVirt gerrit 67597 0 ovirt-engine-4.0 MERGED engine: ovirt-node should wait for restart after upgrade 2016-12-01 08:52:37 UTC

Description Roman Hodain 2016-10-07 12:23:41 UTC
Description of problem:
    When a host is upgraded without putting the host to maintenance mode. The event reports that the host is non responsive 

    It is just a temporary state and the host is flipped to up state after some time.

Version-Release number of selected component (if applicable):

    ovirt-host-deploy-java-1.5.2-1.el7ev.noarch

How reproducible:

    100%

Steps to Reproduce:
1. Do not put host to maintenance mode
2. Click Upgrade

Actual results:

Event log reports:

    VDSM ..... command failed: Heartbeat exeeded

    or

    VDSM commadn failed: Connection reset by peer.


Expected results:

    No error is reported.

Comment 3 Moti Asayag 2016-10-09 07:08:25 UTC
Providing some information of the bug:

When clicking the 'Upgrade' command when host is up, the engine issues the 'Maintenance' command internally and monitors the host status this it become 'Maintenance'. Only then the engine starts the upgrade process, same as the user would have do. The only difference is at the end of upgrade - we attempt to move the host back to 'Up' state. If the host was already set by the user to 'Maintenance' mode, the upgrade process would set the host to 'Maintenance' mode once it is completed.

By looking at the code it seems that the upgrade process is completed successfully, but bringing the host back up takes some time - and this is done after updating the host status to 'initializing' as the last step of the upgrade command. From that point, the host monitoring is responsible for the host life-cycle:

2016-10-07 14:01:30,914 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-5-thread-3) 
[27f60906] Correlation ID: 27f60906, Call Stack: null, Custom Event ID: -1, Message: Host rhodain-rhel7_host01 upgrade was completed successfully.
2016-10-07 14:01:39,601 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Internal server error: null
2016-10-07 14:01:39,606 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [27f60906] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM rhodain-rhel7_host01 command failed: Heartbeat exeeded
...
2016-10-07 14:01:39,607 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler10) [27f60906] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-07 14:01:39,607 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler10) [27f60906] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded


There are no details about the version of the engine and the type of the host.
I suspect the chances for hitting this issue increase with new patch regarding hosted-engine recent change [1] where the host is accessed before it is confirmed to be reachable by the engine. This is not the case from the bug though, since the log doesn't contain the HA command traces in the log ('SetHaMaintenanceMode').

[1] https://gerrit.ovirt.org/#/c/63533/3/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/hostdeploy/UpgradeHostInternalCommand.java

Comment 4 Oved Ourfali 2016-10-10 05:04:13 UTC
Roman, can you give version details as Moti requested?

Comment 6 Moti Asayag 2016-11-28 07:12:03 UTC
Providing additional information for the bug:

Reproducing the flow on my env with debug mode for vdsm-json-rpc has indicated a NullPointerException in vdsm-json-rpc-java client, as a result of a closed channel.

<JsonRpcRequest id: "86502f54-0b9f-46b1-b05b-913cb60ff5ac", method: Host.getAllVmStats, params: {}>
2016-11-28 08:47:04,735 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Heartbeat exceeded. Closing channel
2016-11-28 08:47:04,787 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc":"2.0","error":{"code":"zeus05.eng.lab.tlv.redhat.com:1976388753","message":"Heartbeat exceeded"},"id":null}
2016-11-28 08:47:04,787 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Internal server error: null: java.lang.NullPointerException
	at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.write(SSLClient.java:102) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.processOutgoing(ReactorClient.java:245) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient.process(ReactorClient.java:208) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.reactors.SSLClient.process(SSLClient.java:125) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.processChannels(Reactor.java:89) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.run(Reactor.java:65) [vdsm-jsonrpc-java-client.jar:]

2016-11-28 08:47:04,791 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc":"2.0","error":{"code":"zeus05.eng.lab.tlv.redhat.com:","message":"Internal server error"},"id":null}

Comment 7 Martin Perina 2016-11-30 16:26:02 UTC
Back to POST as we need to merge 4.0 backports

Comment 8 Sandro Bonazzola 2017-01-25 07:54:49 UTC
4.0.6 has been the last oVirt 4.0 release, please re-target this bug.

Comment 9 Petr Kubica 2017-01-31 11:04:17 UTC
Verified in 4.1.0-0.3.beta2.el7


Note You need to log in before you can comment on or make changes to this bug.