Created attachment 625476 [details] engine.log Restarting host fails occasionally (~20%) of the times I run through jenkins. The current logic as follows: (1) action = Stop (2) action=status (pulling 3 times) when we get status=OFF, we continue to (3). (3) action = Start (4) action = status (pulling 3 retries 60s between each retry) until we get status on Please consider to send an additional FenceVdsVDSCommand with action Start, it succeeds most of the times. The reason I think this is important is that with a turned-off host you cannot really do anything. Notes: ====== engine.log is attached. snapshot of the log regarding the logic: ========================================= 2012-10-11 10:28:06,314 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) FINISH, FenceVdsVDSCommand, return: Done, log id: 7039f34c 2012-10-11 10:28:06,314 INFO [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (ajp-/127.0.0.1:8702-1) Waiting for vds puma39.scl.lab.tlv.redhat.com to start 2012-10-11 10:28:11,315 INFO [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (ajp-/127.0.0.1:8702-1) Attempt 1 to get vds puma39.scl.lab.tlv.redhat.com status 2012-10-11 10:28:11,350 INFO [org.ovirt.engine.core.bll.FencingExecutor] (ajp-/127.0.0.1:8702-1) Executing <Status> Power Management command, Proxy Host:puma40.scl.lab.tlv.redhat.com, Agent :ipmilan, Target Host:puma39.scl.lab.tlv.redhat.com, Management IP:puma39-mgmt.scl.lab.tlv.redhat.com, User:admin, Options: 2012-10-11 10:28:11,352 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) START, FenceVdsVDSCommand(HostName = puma40.scl.lab.tlv.redhat.com, HostI d = 3b8c8644-137b-11e2-a8f5-001a4a16976c, targetVdsId = 3ab0a962-137b-11e2-8123-001a4a16976c, action = Status, ip = puma39-mgmt.scl.lab.tlv.redhat.com, port = , type = ipmilan, user = admin, password = ******, options = ''), log id: 397dc723 2012-10-11 10:28:19,782 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) FINISH, FenceVdsVDSCommand, return: Test Succeeded, Host Status is: off, log id: 397dc723 2012-10-11 10:29:01,000 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-66) VM restvm_high 54630ded-6883-4938-aeef-7e7644455257 moved from PoweringUp --> Up 2012-10-11 10:29:01,009 INFO [org.ovirt.engine.core.vdsbroker.UpdateVdsDynamicDataVDSCommand] (QuartzScheduler_Worker-66) START, UpdateVdsDynamicDataVDSCommand(HostName = puma40.scl.lab.tlv .redhat.com, HostId = 3b8c8644-137b-11e2-a8f5-001a4a16976c, vdsDynamic=org.ovirt.engine.core.common.businessentities.VdsDynamic@8f644e69), log id: 521b710c 2012-10-11 10:29:01,012 INFO [org.ovirt.engine.core.vdsbroker.UpdateVdsDynamicDataVDSCommand] (QuartzScheduler_Worker-66) FINISH, UpdateVdsDynamicDataVDSCommand, log id: 521b710c 2012-10-11 10:29:19,783 INFO [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (ajp-/127.0.0.1:8702-1) Attempt 2 to get vds puma39.scl.lab.tlv.redhat.com status 2012-10-11 10:29:19,818 INFO [org.ovirt.engine.core.bll.FencingExecutor] (ajp-/127.0.0.1:8702-1) Executing <Status> Power Management command, Proxy Host:puma40.scl.lab.tlv.redhat.com, Agent :ipmilan, Target Host:puma39.scl.lab.tlv.redhat.com, Management IP:puma39-mgmt.scl.lab.tlv.redhat.com, User:admin, Options: 2012-10-11 10:29:19,820 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) START, FenceVdsVDSCommand(HostName = puma40.scl.lab.tlv.redhat.com, HostId = 3b8c8644-137b-11e2-a8f5-001a4a16976c, targetVdsId = 3ab0a962-137b-11e2-8123-001a4a16976c, action = Status, ip = puma39-mgmt.scl.lab.tlv.redhat.com, port = , type = ipmilan, user = admin, password = ******, options = ''), log id: 596a6a2c 2012-10-11 10:29:19,926 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) FINISH, FenceVdsVDSCommand, return: Test Succeeded, Host Status is: off, log id: 596a6a2c 2012-10-11 10:30:00,001 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-39) Autorecovering hosts is disabled, skipping 2012-10-11 10:30:00,002 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-39) Autorecovering storage domains is disabled, skipping 2012-10-11 10:30:19,926 INFO [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (ajp-/127.0.0.1:8702-1) Attempt 3 to get vds puma39.scl.lab.tlv.redhat.com status 2012-10-11 10:30:19,974 INFO [org.ovirt.engine.core.bll.FencingExecutor] (ajp-/127.0.0.1:8702-1) Executing <Status> Power Management command, Proxy Host:puma40.scl.lab.tlv.redhat.com, Agent:ipmilan, Target Host:puma39.scl.lab.tlv.redhat.com, Management IP:puma39-mgmt.scl.lab.tlv.redhat.com, User:admin, Options: 2012-10-11 10:30:19,976 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) START, FenceVdsVDSCommand(HostName = puma40.scl.lab.tlv.redhat.com, HostId = 3b8c8644-137b-11e2-a8f5-001a4a16976c, targetVdsId = 3ab0a962-137b-11e2-8123-001a4a16976c, action = Status, ip = puma39-mgmt.scl.lab.tlv.redhat.com, port = , type = ipmilan, user = admin, password = ******, options = ''), log id: d671cfd 2012-10-11 10:30:20,100 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (ajp-/127.0.0.1:8702-1) FINISH, FenceVdsVDSCommand, return: Test Succeeded, Host Status is: off, log id: d671cfd 2012-10-11 10:30:20,103 ERROR [org.ovirt.engine.core.bll.FenceVdsBaseCommand] (ajp-/127.0.0.1:8702-1) Failed to verify host puma39.scl.lab.tlv.redhat.com start status. Have retried 3 times with delay of 60 seconds between each retry. 2012-10-11 10:30:20,140 ERROR [org.ovirt.engine.core.bll.StartVdsCommand] (ajp-/127.0.0.1:8702-1) Transaction rolled-back for command: org.ovirt.engine.core.bll.StartVdsCommand. 2012-10-11 10:30:20,140 ERROR [org.ovirt.engine.core.bll.storage.FenceVdsManualyCommand] (ajp-/127.0.0.1:8702-1) Transaction rolled-back for command: org.ovirt.engine.core.bll.storage.FenceVdsManualyCommand. 2012-10-11 10:30:20,140 ERROR [org.ovirt.engine.core.bll.RestartVdsCommand] (ajp-/127.0.0.1:8702-1) Transaction rolled-back for command: org.ovirt.engine.core.bll.RestartVdsCommand. 2012-10-11 10:30:20,140 WARN [org.ovirt.engine.core.bll.RestartVdsCommand] (ajp-/127.0.0.1:8702-1) Restart host action failed, updating host 3ab0a962-137b-11e2-8123-001a4a16976c to NonResponsive 2012-10-11 10:30:20,154 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (ajp-/127.0.0.1:8702-1) START, SetVdsStatusVDSCommand(HostName = puma39.scl.lab.tlv.redhat.com, HostId = 3ab0a962-137b-11e2-8123-001a4a16976c, status=NonResponsive, nonOperationalReason=NONE), log id: 1d685284 2012-10-11 10:30:20,168 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (ajp-/127.0.0.1:8702-1) FINISH, SetVdsStatusVDSCommand, log id: 1d685284 2012-10-11 10:30:20,182 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (ajp-/127.0.0.1:8702-1) Operation Failed: [] 2012-10-11 10:30:58,595 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-85) [7ccf1edf] XML RPC error in command HSMGetAllTasksStatusesVDS ( HostNam e = puma39.scl.lab.tlv.redhat.com ), the error was: java.util.concurrent.TimeoutException, TimeoutException: 2012-10-11 10:30:58,596 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-85) [7ccf1edf] SpmStopVDSCommand::Stopping SPM on vds puma39.scl.lab.tlv.r edhat.com, pool id a9110c84-329c-4fc5-93d8-f08fc9385caf 2012-10-11 10:31:56,670 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-39) VM rest_vm_rhel e2a31e2c-4405-44c5-aa18-53601d3313fa moved from Up --> Reboot InProgress 2012-10-11 10:32:25,589 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-60) VM rest_vm_rhel e2a31e2c-4405-44c5-aa18-53601d3313fa moved from RebootInProgr ess --> Up 2012-10-11 10:33:22,116 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-67) XML RPC error in command GetCapabilitiesVDS ( HostName = puma39.scl.lab .tlv.redhat.com ), the error was: java.util.concurrent.TimeoutException, TimeoutException: 2012-10-11 10:33:58,596 WARN [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-85) [7ccf1edf] could not stop spm of pool a9110c84-329c-4fc5-93d8-f08fc938 5caf on vds 3ab0a962-137b-11e2-8123-001a4a16976c - reason: org.ovirt.engine.core.vdsbroker.xmlrpc.XmlRpcRunTimeException: java.util.concurrent.TimeoutException 2012-10-11 10:33:58,597 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (QuartzScheduler_Worker-85) [7ccf1edf] FINISH, SpmStopVDSCommand, log id: 71b7bf13 2012-10-11 10:33:58,597 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-85) [7ccf1edf] spm stop on spm failed, stopping spm selection! 2012-10-11 10:34:08,629 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-40) hostFromVds::selectedVds - puma40.scl.lab.tlv.redhat.com, spmStatus Fre e, storage pool DefaultRest Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
We encounter this issue quite often in RND integration environment as well .
duplicate of bug 864157?
http://gerrit.ovirt.org/#/c/8620/
https://gerrit.eng.lab.tlv.redhat.com/gitweb?p=ovirt-engine.git;a=commit;h=fabc7d60b2fc21228788948c7d20009d9ff76c72
couldn't reproduce.
fixed in commit : 62b5a5b