Bug 865385 - ovirt-engine-backend: Power Management: restart operation fails occasionally ( during start stage )
ovirt-engine-backend: Power Management: restart operation fails occasionally ...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.1.0
Unspecified Unspecified
unspecified Severity high
: ---
: ---
Assigned To: Eli Mesika
Tareq Alayan
infra
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-10-11 06:13 EDT by Tareq Alayan
Modified: 2016-02-10 14:32 EST (History)
10 users (show)

See Also:
Fixed In Version: si23
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-12-04 15:07:57 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
engine.log (4.50 MB, text/x-log)
2012-10-11 06:13 EDT, Tareq Alayan
no flags Details

  None (edit)
Description Tareq Alayan 2012-10-11 06:13:32 EDT
Created attachment 625476 [details]
engine.log

Restarting host fails occasionally (~20%) of the times I run through jenkins. 

The current logic as follows: 
(1) action = Stop
(2) action=status (pulling 3 times) when we get status=OFF, we continue to (3).
(3) action = Start
(4) action = status (pulling 3 retries 60s between each retry) until we get status on

Please consider to send an additional FenceVdsVDSCommand with action Start, it succeeds most of the times.
The reason I think this is important is that with a turned-off host you cannot really do anything. 

Notes:
======
engine.log is attached.

snapshot of the log regarding the logic:
=========================================
2012-10-11 10:28:06,314 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) FINISH, FenceVdsVDSCommand, return: Done, log id: 7039f34c
2012-10-11 10:28:06,314 INFO  [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (ajp-/127.0.0.1:8702-1) Waiting for vds puma39.scl.lab.tlv.redhat.com to start
2012-10-11 10:28:11,315 INFO  [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (ajp-/127.0.0.1:8702-1) Attempt 1 to get vds puma39.scl.lab.tlv.redhat.com status
2012-10-11 10:28:11,350 INFO  [org.ovirt.engine.core.bll.FencingExecutor] (ajp-/127.0.0.1:8702-1) Executing <Status> Power Management command, Proxy Host:puma40.scl.lab.tlv.redhat.com, Agent
:ipmilan, Target Host:puma39.scl.lab.tlv.redhat.com, Management IP:puma39-mgmt.scl.lab.tlv.redhat.com, User:admin, Options:
2012-10-11 10:28:11,352 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) START, FenceVdsVDSCommand(HostName = puma40.scl.lab.tlv.redhat.com, HostI
d = 3b8c8644-137b-11e2-a8f5-001a4a16976c, targetVdsId = 3ab0a962-137b-11e2-8123-001a4a16976c, action = Status, ip = puma39-mgmt.scl.lab.tlv.redhat.com, port = , type = ipmilan, user = admin,
 password = ******, options = ''), log id: 397dc723
2012-10-11 10:28:19,782 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) FINISH, FenceVdsVDSCommand, return: Test Succeeded, Host Status is: off, 
log id: 397dc723
2012-10-11 10:29:01,000 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-66) VM restvm_high 54630ded-6883-4938-aeef-7e7644455257 moved from PoweringUp -->
 Up
2012-10-11 10:29:01,009 INFO  [org.ovirt.engine.core.vdsbroker.UpdateVdsDynamicDataVDSCommand] (QuartzScheduler_Worker-66) START, UpdateVdsDynamicDataVDSCommand(HostName = puma40.scl.lab.tlv
.redhat.com, HostId = 3b8c8644-137b-11e2-a8f5-001a4a16976c, vdsDynamic=org.ovirt.engine.core.common.businessentities.VdsDynamic@8f644e69), log id: 521b710c
2012-10-11 10:29:01,012 INFO  [org.ovirt.engine.core.vdsbroker.UpdateVdsDynamicDataVDSCommand] (QuartzScheduler_Worker-66) FINISH, UpdateVdsDynamicDataVDSCommand, log id: 521b710c
2012-10-11 10:29:19,783 INFO  [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (ajp-/127.0.0.1:8702-1) Attempt 2 to get vds puma39.scl.lab.tlv.redhat.com status
2012-10-11 10:29:19,818 INFO  [org.ovirt.engine.core.bll.FencingExecutor] (ajp-/127.0.0.1:8702-1) Executing <Status> Power Management command, Proxy Host:puma40.scl.lab.tlv.redhat.com, Agent
:ipmilan, Target Host:puma39.scl.lab.tlv.redhat.com, Management IP:puma39-mgmt.scl.lab.tlv.redhat.com, User:admin, Options:
2012-10-11 10:29:19,820 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) START, FenceVdsVDSCommand(HostName = puma40.scl.lab.tlv.redhat.com, HostId = 3b8c8644-137b-11e2-a8f5-001a4a16976c, targetVdsId = 3ab0a962-137b-11e2-8123-001a4a16976c, action = Status, ip = puma39-mgmt.scl.lab.tlv.redhat.com, port = , type = ipmilan, user = admin, password = ******, options = ''), log id: 596a6a2c
2012-10-11 10:29:19,926 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) FINISH, FenceVdsVDSCommand, return: Test Succeeded, Host Status is: off, log id: 596a6a2c
2012-10-11 10:30:00,001 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-39) Autorecovering hosts is disabled, skipping
2012-10-11 10:30:00,002 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-39) Autorecovering storage domains is disabled, skipping
2012-10-11 10:30:19,926 INFO  [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (ajp-/127.0.0.1:8702-1) Attempt 3 to get vds puma39.scl.lab.tlv.redhat.com status
2012-10-11 10:30:19,974 INFO  [org.ovirt.engine.core.bll.FencingExecutor] (ajp-/127.0.0.1:8702-1) Executing <Status> Power Management command, Proxy Host:puma40.scl.lab.tlv.redhat.com, Agent:ipmilan, Target Host:puma39.scl.lab.tlv.redhat.com, Management IP:puma39-mgmt.scl.lab.tlv.redhat.com, User:admin, Options:
2012-10-11 10:30:19,976 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) START, FenceVdsVDSCommand(HostName = puma40.scl.lab.tlv.redhat.com, HostId = 3b8c8644-137b-11e2-a8f5-001a4a16976c, targetVdsId = 3ab0a962-137b-11e2-8123-001a4a16976c, action = Status, ip = puma39-mgmt.scl.lab.tlv.redhat.com, port = , type = ipmilan, user = admin, password = ******, options = ''), log id: d671cfd
2012-10-11 10:30:20,100 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) FINISH, FenceVdsVDSCommand, return: Test Succeeded, Host Status is: off, log id: d671cfd
2012-10-11 10:30:20,103 ERROR [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (ajp-/127.0.0.1:8702-1) Failed to verify host puma39.scl.lab.tlv.redhat.com start status. Have retried 3 times with delay of 60 seconds between each retry.
2012-10-11 10:30:20,140 ERROR [org.ovirt.engine.core.bll.StartVdsCommand] (ajp-/127.0.0.1:8702-1) Transaction rolled-back for command: org.ovirt.engine.core.bll.StartVdsCommand.
2012-10-11 10:30:20,140 ERROR [org.ovirt.engine.core.bll.storage.FenceVdsManualyCommand] (ajp-/127.0.0.1:8702-1) Transaction rolled-back for command: org.ovirt.engine.core.bll.storage.FenceVdsManualyCommand.
2012-10-11 10:30:20,140 ERROR [org.ovirt.engine.core.bll.RestartVdsCommand] (ajp-/127.0.0.1:8702-1) Transaction rolled-back for command: org.ovirt.engine.core.bll.RestartVdsCommand.
2012-10-11 10:30:20,140 WARN  [org.ovirt.engine.core.bll.RestartVdsCommand] (ajp-/127.0.0.1:8702-1) Restart host action failed, updating host 3ab0a962-137b-11e2-8123-001a4a16976c to NonResponsive
2012-10-11 10:30:20,154 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (ajp-/127.0.0.1:8702-1) START, SetVdsStatusVDSCommand(HostName = puma39.scl.lab.tlv.redhat.com, HostId = 3ab0a962-137b-11e2-8123-001a4a16976c, status=NonResponsive, nonOperationalReason=NONE), log id: 1d685284
2012-10-11 10:30:20,168 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (ajp-/127.0.0.1:8702-1) FINISH, SetVdsStatusVDSCommand, log id: 1d685284
2012-10-11 10:30:20,182 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (ajp-/127.0.0.1:8702-1) Operation Failed: []
2012-10-11 10:30:58,595 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-85) [7ccf1edf] XML RPC error in command HSMGetAllTasksStatusesVDS ( HostNam
e = puma39.scl.lab.tlv.redhat.com ), the error was: java.util.concurrent.TimeoutException, TimeoutException:  
2012-10-11 10:30:58,596 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-85) [7ccf1edf] SpmStopVDSCommand::Stopping SPM on vds puma39.scl.lab.tlv.r
edhat.com, pool id a9110c84-329c-4fc5-93d8-f08fc9385caf
2012-10-11 10:31:56,670 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-39) VM rest_vm_rhel e2a31e2c-4405-44c5-aa18-53601d3313fa moved from Up --> Reboot
InProgress
2012-10-11 10:32:25,589 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-60) VM rest_vm_rhel e2a31e2c-4405-44c5-aa18-53601d3313fa moved from RebootInProgr
ess --> Up
2012-10-11 10:33:22,116 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-67) XML RPC error in command GetCapabilitiesVDS ( HostName = puma39.scl.lab
.tlv.redhat.com ), the error was: java.util.concurrent.TimeoutException, TimeoutException:  
2012-10-11 10:33:58,596 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-85) [7ccf1edf] could not stop spm of pool a9110c84-329c-4fc5-93d8-f08fc938
5caf on vds 3ab0a962-137b-11e2-8123-001a4a16976c - reason: org.ovirt.engine.core.vdsbroker.xmlrpc.XmlRpcRunTimeException: java.util.concurrent.TimeoutException
2012-10-11 10:33:58,597 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-85) [7ccf1edf] FINISH, SpmStopVDSCommand, log id: 71b7bf13
2012-10-11 10:33:58,597 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-85) [7ccf1edf] spm stop on spm failed, stopping spm selection!
2012-10-11 10:34:08,629 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-40) hostFromVds::selectedVds - puma40.scl.lab.tlv.redhat.com, spmStatus Fre
e, storage pool DefaultRest


Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 Oded Ramraz 2012-10-11 06:32:34 EDT
We encounter this issue quite often in RND integration environment as well .
Comment 2 Itamar Heim 2012-10-15 07:11:07 EDT
duplicate of bug 864157?
Comment 3 Eli Mesika 2012-10-17 05:38:01 EDT
http://gerrit.ovirt.org/#/c/8620/
Comment 7 Tareq Alayan 2012-11-05 03:11:18 EST
couldn't reproduce.
Comment 8 Eli Mesika 2012-12-05 14:18:27 EST
fixed in commit : 62b5a5b

Note You need to log in before you can comment on or make changes to this bug.