Bug 865385 - ovirt-engine-backend: Power Management: restart operation fails occasionally ( during start stage )
Summary: ovirt-engine-backend: Power Management: restart operation fails occasionally ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Eli Mesika
QA Contact: Tareq Alayan
URL:
Whiteboard: infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-10-11 10:13 UTC by Tareq Alayan
Modified: 2016-02-10 19:32 UTC (History)
10 users (show)

Fixed In Version: si23
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-12-04 20:07:57 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
engine.log (4.50 MB, text/x-log)
2012-10-11 10:13 UTC, Tareq Alayan
no flags Details

Description Tareq Alayan 2012-10-11 10:13:32 UTC
Created attachment 625476 [details]
engine.log

Restarting host fails occasionally (~20%) of the times I run through jenkins. 

The current logic as follows: 
(1) action = Stop
(2) action=status (pulling 3 times) when we get status=OFF, we continue to (3).
(3) action = Start
(4) action = status (pulling 3 retries 60s between each retry) until we get status on

Please consider to send an additional FenceVdsVDSCommand with action Start, it succeeds most of the times.
The reason I think this is important is that with a turned-off host you cannot really do anything. 

Notes:
======
engine.log is attached.

snapshot of the log regarding the logic:
=========================================
2012-10-11 10:28:06,314 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) FINISH, FenceVdsVDSCommand, return: Done, log id: 7039f34c
2012-10-11 10:28:06,314 INFO  [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (ajp-/127.0.0.1:8702-1) Waiting for vds puma39.scl.lab.tlv.redhat.com to start
2012-10-11 10:28:11,315 INFO  [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (ajp-/127.0.0.1:8702-1) Attempt 1 to get vds puma39.scl.lab.tlv.redhat.com status
2012-10-11 10:28:11,350 INFO  [org.ovirt.engine.core.bll.FencingExecutor] (ajp-/127.0.0.1:8702-1) Executing <Status> Power Management command, Proxy Host:puma40.scl.lab.tlv.redhat.com, Agent
:ipmilan, Target Host:puma39.scl.lab.tlv.redhat.com, Management IP:puma39-mgmt.scl.lab.tlv.redhat.com, User:admin, Options:
2012-10-11 10:28:11,352 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) START, FenceVdsVDSCommand(HostName = puma40.scl.lab.tlv.redhat.com, HostI
d = 3b8c8644-137b-11e2-a8f5-001a4a16976c, targetVdsId = 3ab0a962-137b-11e2-8123-001a4a16976c, action = Status, ip = puma39-mgmt.scl.lab.tlv.redhat.com, port = , type = ipmilan, user = admin,
 password = ******, options = ''), log id: 397dc723
2012-10-11 10:28:19,782 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) FINISH, FenceVdsVDSCommand, return: Test Succeeded, Host Status is: off, 
log id: 397dc723
2012-10-11 10:29:01,000 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-66) VM restvm_high 54630ded-6883-4938-aeef-7e7644455257 moved from PoweringUp -->
 Up
2012-10-11 10:29:01,009 INFO  [org.ovirt.engine.core.vdsbroker.UpdateVdsDynamicDataVDSCommand] (QuartzScheduler_Worker-66) START, UpdateVdsDynamicDataVDSCommand(HostName = puma40.scl.lab.tlv
.redhat.com, HostId = 3b8c8644-137b-11e2-a8f5-001a4a16976c, vdsDynamic=org.ovirt.engine.core.common.businessentities.VdsDynamic@8f644e69), log id: 521b710c
2012-10-11 10:29:01,012 INFO  [org.ovirt.engine.core.vdsbroker.UpdateVdsDynamicDataVDSCommand] (QuartzScheduler_Worker-66) FINISH, UpdateVdsDynamicDataVDSCommand, log id: 521b710c
2012-10-11 10:29:19,783 INFO  [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (ajp-/127.0.0.1:8702-1) Attempt 2 to get vds puma39.scl.lab.tlv.redhat.com status
2012-10-11 10:29:19,818 INFO  [org.ovirt.engine.core.bll.FencingExecutor] (ajp-/127.0.0.1:8702-1) Executing <Status> Power Management command, Proxy Host:puma40.scl.lab.tlv.redhat.com, Agent
:ipmilan, Target Host:puma39.scl.lab.tlv.redhat.com, Management IP:puma39-mgmt.scl.lab.tlv.redhat.com, User:admin, Options:
2012-10-11 10:29:19,820 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) START, FenceVdsVDSCommand(HostName = puma40.scl.lab.tlv.redhat.com, HostId = 3b8c8644-137b-11e2-a8f5-001a4a16976c, targetVdsId = 3ab0a962-137b-11e2-8123-001a4a16976c, action = Status, ip = puma39-mgmt.scl.lab.tlv.redhat.com, port = , type = ipmilan, user = admin, password = ******, options = ''), log id: 596a6a2c
2012-10-11 10:29:19,926 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) FINISH, FenceVdsVDSCommand, return: Test Succeeded, Host Status is: off, log id: 596a6a2c
2012-10-11 10:30:00,001 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-39) Autorecovering hosts is disabled, skipping
2012-10-11 10:30:00,002 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-39) Autorecovering storage domains is disabled, skipping
2012-10-11 10:30:19,926 INFO  [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (ajp-/127.0.0.1:8702-1) Attempt 3 to get vds puma39.scl.lab.tlv.redhat.com status
2012-10-11 10:30:19,974 INFO  [org.ovirt.engine.core.bll.FencingExecutor] (ajp-/127.0.0.1:8702-1) Executing <Status> Power Management command, Proxy Host:puma40.scl.lab.tlv.redhat.com, Agent:ipmilan, Target Host:puma39.scl.lab.tlv.redhat.com, Management IP:puma39-mgmt.scl.lab.tlv.redhat.com, User:admin, Options:
2012-10-11 10:30:19,976 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) START, FenceVdsVDSCommand(HostName = puma40.scl.lab.tlv.redhat.com, HostId = 3b8c8644-137b-11e2-a8f5-001a4a16976c, targetVdsId = 3ab0a962-137b-11e2-8123-001a4a16976c, action = Status, ip = puma39-mgmt.scl.lab.tlv.redhat.com, port = , type = ipmilan, user = admin, password = ******, options = ''), log id: d671cfd
2012-10-11 10:30:20,100 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) FINISH, FenceVdsVDSCommand, return: Test Succeeded, Host Status is: off, log id: d671cfd
2012-10-11 10:30:20,103 ERROR [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (ajp-/127.0.0.1:8702-1) Failed to verify host puma39.scl.lab.tlv.redhat.com start status. Have retried 3 times with delay of 60 seconds between each retry.
2012-10-11 10:30:20,140 ERROR [org.ovirt.engine.core.bll.StartVdsCommand] (ajp-/127.0.0.1:8702-1) Transaction rolled-back for command: org.ovirt.engine.core.bll.StartVdsCommand.
2012-10-11 10:30:20,140 ERROR [org.ovirt.engine.core.bll.storage.FenceVdsManualyCommand] (ajp-/127.0.0.1:8702-1) Transaction rolled-back for command: org.ovirt.engine.core.bll.storage.FenceVdsManualyCommand.
2012-10-11 10:30:20,140 ERROR [org.ovirt.engine.core.bll.RestartVdsCommand] (ajp-/127.0.0.1:8702-1) Transaction rolled-back for command: org.ovirt.engine.core.bll.RestartVdsCommand.
2012-10-11 10:30:20,140 WARN  [org.ovirt.engine.core.bll.RestartVdsCommand] (ajp-/127.0.0.1:8702-1) Restart host action failed, updating host 3ab0a962-137b-11e2-8123-001a4a16976c to NonResponsive
2012-10-11 10:30:20,154 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (ajp-/127.0.0.1:8702-1) START, SetVdsStatusVDSCommand(HostName = puma39.scl.lab.tlv.redhat.com, HostId = 3ab0a962-137b-11e2-8123-001a4a16976c, status=NonResponsive, nonOperationalReason=NONE), log id: 1d685284
2012-10-11 10:30:20,168 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (ajp-/127.0.0.1:8702-1) FINISH, SetVdsStatusVDSCommand, log id: 1d685284
2012-10-11 10:30:20,182 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (ajp-/127.0.0.1:8702-1) Operation Failed: []
2012-10-11 10:30:58,595 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-85) [7ccf1edf] XML RPC error in command HSMGetAllTasksStatusesVDS ( HostNam
e = puma39.scl.lab.tlv.redhat.com ), the error was: java.util.concurrent.TimeoutException, TimeoutException:  
2012-10-11 10:30:58,596 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-85) [7ccf1edf] SpmStopVDSCommand::Stopping SPM on vds puma39.scl.lab.tlv.r
edhat.com, pool id a9110c84-329c-4fc5-93d8-f08fc9385caf
2012-10-11 10:31:56,670 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-39) VM rest_vm_rhel e2a31e2c-4405-44c5-aa18-53601d3313fa moved from Up --> Reboot
InProgress
2012-10-11 10:32:25,589 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-60) VM rest_vm_rhel e2a31e2c-4405-44c5-aa18-53601d3313fa moved from RebootInProgr
ess --> Up
2012-10-11 10:33:22,116 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-67) XML RPC error in command GetCapabilitiesVDS ( HostName = puma39.scl.lab
.tlv.redhat.com ), the error was: java.util.concurrent.TimeoutException, TimeoutException:  
2012-10-11 10:33:58,596 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-85) [7ccf1edf] could not stop spm of pool a9110c84-329c-4fc5-93d8-f08fc938
5caf on vds 3ab0a962-137b-11e2-8123-001a4a16976c - reason: org.ovirt.engine.core.vdsbroker.xmlrpc.XmlRpcRunTimeException: java.util.concurrent.TimeoutException
2012-10-11 10:33:58,597 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-85) [7ccf1edf] FINISH, SpmStopVDSCommand, log id: 71b7bf13
2012-10-11 10:33:58,597 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-85) [7ccf1edf] spm stop on spm failed, stopping spm selection!
2012-10-11 10:34:08,629 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-40) hostFromVds::selectedVds - puma40.scl.lab.tlv.redhat.com, spmStatus Fre
e, storage pool DefaultRest


Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Oded Ramraz 2012-10-11 10:32:34 UTC
We encounter this issue quite often in RND integration environment as well .

Comment 2 Itamar Heim 2012-10-15 11:11:07 UTC
duplicate of bug 864157?

Comment 3 Eli Mesika 2012-10-17 09:38:01 UTC
http://gerrit.ovirt.org/#/c/8620/

Comment 7 Tareq Alayan 2012-11-05 08:11:18 UTC
couldn't reproduce.

Comment 8 Eli Mesika 2012-12-05 19:18:27 UTC
fixed in commit : 62b5a5b


Note You need to log in before you can comment on or make changes to this bug.