Bug 1223555 - Hosted Engine going continuously to EngineUpBadHealth-EngineDown, migrating to one node to the other
Summary: Hosted Engine going continuously to EngineUpBadHealth-EngineDown, migrating t...
Keywords:
Status: CLOSED DUPLICATE of bug 1150600
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-hosted-engine-ha
Version: 3.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.5.5
Assignee: Martin Sivák
QA Contact: meital avital
URL:
Whiteboard: sla
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-05-20 20:50 UTC by Mario Rial
Modified: 2016-02-10 19:41 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-05-26 16:22:43 UTC
oVirt Team: SLA
Embargoed:


Attachments (Terms of Use)
agent.log Host1 (copia.virt..) (2.08 MB, text/plain)
2015-05-25 15:52 UTC, Mario Rial
no flags Details
agent.log Host2 (victorcito.virt...) (2.04 MB, text/plain)
2015-05-25 15:52 UTC, Mario Rial
no flags Details
broker.log Host1 (copia.virt...) (9.19 MB, text/plain)
2015-05-25 15:54 UTC, Mario Rial
no flags Details
broker.log Host2 (victorcito.virt...) (9.44 MB, text/plain)
2015-05-25 15:55 UTC, Mario Rial
no flags Details
agent.log Host1 (copia.virt..) 1.2.6-2 (1.95 MB, text/plain)
2015-05-26 15:47 UTC, Mario Rial
no flags Details
broker.log Host2 (victorcito.virt...) 1.2.6-2 (1.98 MB, text/plain)
2015-05-26 15:48 UTC, Mario Rial
no flags Details
broker.log Host1 (copia.virt...) 1.2.6-2 (8.98 MB, text/plain)
2015-05-26 15:49 UTC, Mario Rial
no flags Details
agent.log Host2 (victorcito.virt...) 1.2.6-2 (1.98 MB, text/plain)
2015-05-26 15:51 UTC, Mario Rial
no flags Details

Description Mario Rial 2015-05-20 20:50:11 UTC
Hello,

In our installation of oVirt consisting in two nodes, with the Hosted Engine installed in both. The Hosted Engine is continuosly (about each hour) going from one node to the other.

I show the relevant part of agent.log:

MainThread::INFO::2015-05-20 22:04:02,204::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400)
MainThread::INFO::2015-05-20 22:04:02,205::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host victorcito.virt.seguridadea1.com (id: 2, score: 2400)
MainThread::INFO::2015-05-20 22:04:14,721::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400)
MainThread::INFO::2015-05-20 22:04:14,722::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host victorcito.virt.seguridadea1.com (id: 2, score: 2400)
MainThread::INFO::2015-05-20 22:04:26,464::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400)
MainThread::INFO::2015-05-20 22:04:26,464::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host victorcito.virt.seguridadea1.com (id: 2, score: 2400)
MainThread::INFO::2015-05-20 22:04:38,220::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400)
MainThread::INFO::2015-05-20 22:04:38,220::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host victorcito.virt.seguridadea1.com (id: 2, score: 2400)
MainThread::INFO::2015-05-20 22:04:50,247::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400)
MainThread::INFO::2015-05-20 22:04:50,247::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host victorcito.virt.seguridadea1.com (id: 2, score: 2400)
MainThread::INFO::2015-05-20 22:05:00,329::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1432152300.33 type=state_transition detail=EngineUp-EngineUpBadHealth hostname='copia.virt.seguridadea1.com'
MainThread::INFO::2015-05-20 22:05:00,709::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUp-EngineUpBadHealth) sent? sent
MainThread::INFO::2015-05-20 22:05:02,264::states::576::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to bad engine health at Wed May 20 22:05:02 2015
MainThread::INFO::2015-05-20 22:05:02,264::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUpBadHealth (score: 0)
MainThread::INFO::2015-05-20 22:05:02,264::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host victorcito.virt.seguridadea1.com (id: 2, score: 2400)
MainThread::ERROR::2015-05-20 22:05:12,318::states::562::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine VM has bad health status, timeout in 300 seconds
MainThread::INFO::2015-05-20 22:05:12,320::states::576::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to bad engine health at Wed May 20 22:05:12 2015
MainThread::ERROR::2015-05-20 22:05:12,320::states::385::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host victorcito.virt.seguridadea1.com (id 2) score is significantly better than local score, shutting down VM on this host
MainThread::INFO::2015-05-20 22:05:12,336::state_decorators::88::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineUpBadHealth'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'>
MainThread::INFO::2015-05-20 22:05:12,354::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1432152312.35 type=state_transition detail=EngineUpBadHealth-EngineStop hostname='copia.virt.seguridadea1.com'
MainThread::INFO::2015-05-20 22:05:13,224::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUpBadHealth-EngineStop) sent? sent
MainThread::INFO::2015-05-20 22:05:15,240::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineStop (score: 2400)
MainThread::INFO::2015-05-20 22:05:15,241::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host victorcito.virt.seguridadea1.com (id: 2, score: 2400)
MainThread::INFO::2015-05-20 22:05:25,287::hosted_engine::956::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Shutting down vm using `/usr/sbin/hosted-engine --vm-shutdown`
MainThread::INFO::2015-05-20 22:05:25,636::hosted_engine::961::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stdout: Machine shutting down

MainThread::INFO::2015-05-20 22:05:25,637::hosted_engine::962::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stderr: 
MainThread::ERROR::2015-05-20 22:05:25,637::hosted_engine::970::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Engine VM stopped on localhost
MainThread::INFO::2015-05-20 22:05:25,657::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout set to Wed May 20 22:10:25 2015 while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'>
MainThread::INFO::2015-05-20 22:05:27,426::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineStop (score: 2400)
MainThread::INFO::2015-05-20 22:05:27,427::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host victorcito.virt.seguridadea1.com (id: 2, score: 2400)
MainThread::INFO::2015-05-20 22:05:37,460::states::532::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Waiting on shutdown to complete (12 of 300 seconds)
MainThread::INFO::2015-05-20 22:05:39,164::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineStop (score: 2400)
MainThread::INFO::2015-05-20 22:05:39,164::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host victorcito.virt.seguridadea1.com (id: 2, score: 2400)



And the other relevant part of broker.log:

Thread-86306::INFO::2015-05-20 22:04:40,187::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-86307::INFO::2015-05-20 22:04:40,191::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-86307::INFO::2015-05-20 22:04:40,204::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-4::INFO::2015-05-20 22:04:40,344::mem_free::53::mem_free.MemFree::(action) memFree: 13582
Thread-6::INFO::2015-05-20 22:04:44,322::engine_health::110::engine_health.CpuLoadNoEngine::(action) VM is up on this host with healthy engine
Thread-2::INFO::2015-05-20 22:04:47,422::ping::52::ping.Ping::(action) Successfully pinged 10.7.1.254
Thread-86308::INFO::2015-05-20 22:04:49,860::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-86308::INFO::2015-05-20 22:04:49,872::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-86309::INFO::2015-05-20 22:04:49,875::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-86309::INFO::2015-05-20 22:04:49,882::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-86310::INFO::2015-05-20 22:04:49,883::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-86310::INFO::2015-05-20 22:04:49,891::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-86311::INFO::2015-05-20 22:04:50,411::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-86311::INFO::2015-05-20 22:04:50,418::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-86312::INFO::2015-05-20 22:04:50,419::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-86312::INFO::2015-05-20 22:04:50,420::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-86313::INFO::2015-05-20 22:04:50,421::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-86313::INFO::2015-05-20 22:04:50,436::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-4::INFO::2015-05-20 22:04:50,456::mem_free::53::mem_free.MemFree::(action) memFree: 13583
Thread-6::WARNING::2015-05-20 22:04:54,734::engine_health::116::engine_health.CpuLoadNoEngine::(action) bad health status: Hosted Engine is not up!

Thread-86314::INFO::2015-05-20 22:05:00,543::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-86314::INFO::2015-05-20 22:05:00,590::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-86315::INFO::2015-05-20 22:05:00,591::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-86316::INFO::2015-05-20 22:05:00,593::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-86315::INFO::2015-05-20 22:05:00,595::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-86316::INFO::2015-05-20 22:05:00,595::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-4::INFO::2015-05-20 22:05:00,622::mem_free::53::mem_free.MemFree::(action) memFree: 13585
Thread-86317::INFO::2015-05-20 22:05:05,590::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established


oVirt Release 3.5.2

The hosted engine is on an NFS data domain.

The remaining virtual machines are running perfectly on both nodes, and are on the same NFS server (different path obviously). The only thing I did to de Hosted Engine from the default installation is changing the network interface, but it seems that that's not the problem.


Additional info:

hosted-engine.conf:

fqdn=engine.virt.seguridadea1.com
vm_disk_id=320fc5bb-5b5d-4564-8d2f-9cce5584011f
vmid=f5db5eb7-6062-4e60-97f2-3eb73833eede
storage=192.168.20.202:/ovirtengine
conf=/etc/ovirt-hosted-engine/vm.conf
service_start_time=0
host_id=1
console=vnc
domainType=nfs3
spUUID=cff6faab-c9b0-4d9f-8b02-3cf2999be9c7
sdUUID=c7a9796d-34ea-4ea6-9197-1053d2ca00b1
connectionUUID=f3bddec8-6455-4013-82ca-d2fd628c02cb
ca_cert=/etc/pki/vdsm/libvirt-spice/ca-cert.pem
ca_subject="C=EN, L=Test, O=Test, CN=Test"
vdsm_use_ssl=true
gateway=10.7.1.254
bridge=serv_internos
metadata_volume_UUID=f51bd0ac-a879-45c3-a7ee-c42502339944
metadata_image_UUID=b4fd000c-6e9b-4a77-b1e8-eed7d2ba93c6
lockspace_volume_UUID=ab8b4015-a713-4b42-8b35-e90fedae4988
lockspace_image_UUID=5dc8a80f-5a8c-47e0-b2dc-4552b0eae5a6

vm.conf:

vmId=f5db5eb7-6062-4e60-97f2-3eb73833eede
memSize=4096
display=vnc
devices={index:2,iface:ide,address:{ controller:0, target:0,unit:0, bus:1, type:drive},specParams:{},readonly:true,deviceId:d620e110-17c2-4eee-b693-b3b52db8aa6c,path:/opt/CentOS-6.6-x86_64-minimal.iso,device:cdrom,shared:false,type:disk}
devices={index:0,iface:virtio,format:raw,poolID:00000000-0000-0000-0000-000000000000,volumeID:cc12a380-c61f-42ea-a123-a236b50b30a9,imageID:320fc5bb-5b5d-4564-8d2f-9cce5584011f,specParams:{},readonly:false,domainID:c7a9796d-34ea-4ea6-9197-1053d2ca00b1,optional:false,deviceId:320fc5bb-5b5d-4564-8d2f-9cce5584011f,address:{bus:0x00, slot:0x06, domain:0x0000, type:pci, function:0x0},device:disk,shared:exclusive,propagateErrors:off,type:disk,bootOrder:1}
devices={device:scsi,model:virtio-scsi,type:controller}
devices={nicModel:pv,macAddr:00:16:3e:7c:6b:0c,linkActive:true,network:serv_internos,filter:vdsm-no-mac-spoofing,specParams:{},deviceId:88eaaf56-af3d-4b80-811e-927f5e6c2d56,address:{bus:0x00, slot:0x03, domain:0x0000, type:pci, function:0x0},device:bridge,type:interface}
devices={device:console,specParams:{},type:console,deviceId:b6d27e83-71e7-4903-957d-47459ed23ad9,alias:console0}
vmName=HostedEngine
spiceSecureChannels=smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir
smp=2
cpuType=Conroe
emulatedMachine=pc


Thanks.

Comment 1 Mario Rial 2015-05-20 20:56:15 UTC
I comfirm that the engine VM does not stop by it self. It is stoped from outside gracefully, if you are logged in it you can see something like:

[root@engine ~]# 
Broadcast message from root.seguridadea1.com
	(unknown) at 22:05 ...

The system is going down for halt in 2 minutes!
"VM is shutting down!" 

Broadcast message from root.seguridadea1.com
	(unknown) at 22:06 ...

The system is going down for halt in 1 minute!
"VM is shutting down!" 

Broadcast message from root.seguridadea1.com
	(unknown) at 22:07 ...

The system is going down for halt NOW!
"VM is shutting down!" 
Connection to 192.168.20.160 closed by remote host.
Connection to 192.168.20.160 closed.

------------


At 22:05 is the moment where it goes to BadHealth state and the broker.log says:

Thread-6::WARNING::2015-05-20 22:04:54,734::engine_health::116::engine_health.CpuLoadNoEngine::(action) bad health status: Hosted Engine is not up!


But it is UP!


Thanks for your help.

Regards.

Comment 2 Doron Fediuck 2015-05-25 07:30:29 UTC
Thanks for the report, Mario.
Can you please attach all relevant log files (with host names or ID)?

Comment 3 Martin Sivák 2015-05-25 09:15:23 UTC
This looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1150600

Please attach the logs or test with upgraded hosted engine package. There can be a periodic job that loads your host every hour and the status reports that engine could not be reached. The fix makes it obey the grace period (5 minutes iirc) so it has time to recover before actually forcing migration.

Comment 4 Mario Rial 2015-05-25 15:51:10 UTC
Hello,

I have read the other bug and I think is not a duplicate.

I did some research in the code of the broker and the problem seems to be very random.

I put some log code in check_liveliness.py, because I think that is the responsible of detect the Engine as down... I attach a greped log of every faulty response of the liveliness check of it in the file liveliness.py.

The logs have only the day of today (25/05/2015):

The service notify me by mail these states transitions:


13:26 - EngineUp-EngineUpBadHealth
13:26 - EngineUpBadHealth - EngineStop
13:29 - EngineStop - EngineDown
13:29 - EngineDown - EngineStart
13:29 - EngineStart - EngineStarting
13:29 - EngineStarting - EngineStop
13:42 - EngineStop - EngineDown


I put here the log of the calls of check_liveliness.py on Host1 (called copia.virt.... ):

2015-05-25 09:58:15.383008 : NOT RUNNING
2015-05-25 11:46:10.874021 : NOT RUNNING
2015-05-25 13:26:06.289082 : NOT RUNNING
2015-05-25 13:28:56.461911 : NOT RUNNING
2015-05-25 13:29:10.253072 : NOT RUNNING
2015-05-25 14:11:00.464599 : NOT RUNNING
2015-05-25 14:11:10.043068 : NOT RUNNING
2015-05-25 14:11:21.482542 : NOT RUNNING
2015-05-25 14:11:29.727406 : NOT RUNNING
2015-05-25 15:43:20.805004 : NOT RUNNING
2015-05-25 15:46:04.835796 : NOT RUNNING
2015-05-25 15:46:12.351592 : NOT RUNNING
2015-05-25 16:28:21.914958 : NOT RUNNING
2015-05-25 16:28:32.159975 : NOT RUNNING
2015-05-25 16:28:45.626845 : NOT RUNNING
2015-05-25 16:28:52.157795 : NOT RUNNING
2015-05-25 16:29:01.648473 : NOT RUNNING

I provide both agent.log and broker.log of the two hosts.

Regards.
Mario.

Comment 5 Mario Rial 2015-05-25 15:52:17 UTC
Created attachment 1029507 [details]
agent.log Host1 (copia.virt..)

Comment 6 Mario Rial 2015-05-25 15:52:59 UTC
Created attachment 1029509 [details]
agent.log Host2 (victorcito.virt...)

Comment 7 Mario Rial 2015-05-25 15:54:45 UTC
Created attachment 1029522 [details]
broker.log Host1 (copia.virt...)

Comment 8 Mario Rial 2015-05-25 15:55:47 UTC
Created attachment 1029523 [details]
broker.log Host2 (victorcito.virt...)

Comment 9 Mario Rial 2015-05-26 14:23:25 UTC
Hello again,

I have upgraded the ovirt-hosted-engine-ha to the last version on GitHub (1.2.6-2) and the problem persists.

Regards.

Mario.

Comment 10 Martin Sivák 2015-05-26 14:47:55 UTC
Can you give us the logs from the upgraded system? The logs #5-#8 are from the older version (score is reduced to 0 when in BadHealth state). We removed the score penalty in 1.2.6-2 so it should have a different reason now.

Comment 11 Mario Rial 2015-05-26 15:46:44 UTC
Hello Martin,

I did some research on the logs, now (for the moment) the Engine only goes from EngineUp to EngineUpBadHealth.

I show you the state transitions of today:

08:56 - EngineUp-EngineUpBadHealth
08:56 - EngineUpBadHealth-EngineUp

09:25 - EngineUp-EngineUpBadHealth
09:25 - EngineUpBadHealth-EngineUp

09:30 - EngineUp-EngineUpBadHealth
09:30 - EngineUpBadHealth-Engineup

15:06 - EngineUp-EngineUpBadHealth
15:06 - EngineUpBadHealth-EngineUp

And this is the logs of the failed calls to check_liveliness.py:

2015-05-26 08:56:15.849889 : ENGINE DOWN
2015-05-26 09:25:06.897624 : ENGINE DOWN
2015-05-26 09:30:08.378101 : ENGINE DOWN
2015-05-26 09:45:18.621886 : ENGINE DOWN
2015-05-26 15:06:25.075821 : ENGINE DOWN


We saw that now the engine don't goes down, but now the problem I think that is that the check_liveliness should be more permisive and give more time before telling that the engine is down. Because 2 or 3 seconds after telling it's down tell it's up again.

I also attach the logs, the host1 (copia.virt...) is the host with the engine running.

I have one question, It's correct to migrate the engine throught the Web Admin interface??

Thank you.

Regards.

Comment 12 Mario Rial 2015-05-26 15:47:41 UTC
Created attachment 1030011 [details]
agent.log Host1 (copia.virt..) 1.2.6-2

Comment 13 Mario Rial 2015-05-26 15:48:18 UTC
Created attachment 1030012 [details]
broker.log Host2 (victorcito.virt...) 1.2.6-2

Comment 14 Mario Rial 2015-05-26 15:49:13 UTC
Created attachment 1030013 [details]
broker.log Host1 (copia.virt...) 1.2.6-2

Comment 15 Mario Rial 2015-05-26 15:51:08 UTC
Created attachment 1030016 [details]
agent.log Host2 (victorcito.virt...) 1.2.6-2

Comment 16 Martin Sivák 2015-05-26 16:09:20 UTC
Hi, thanks for the logs. I do not see any migration or start/stop logs in what you just attached to the bugzilla. Are you sure your engine still migrates around?

I see the VM is:
- running on Host1
- going to EngineUpBadHealth couple of times
- staying there for about 10-20 seconds
- and then returning to EngineUp

Which is exactly what is supposed to happen when there is an issue with retrieving the status page due to CPU load (for example).

Comment 17 Mario Rial 2015-05-26 16:18:06 UTC
Hi, Martin.

Yes after looking at the logs, now seems that at this version (1.2.6-2) the problem of migration seems to be solved.

Maybe the bug can be marked as solved.

Thanks for your help.

Regards.

Comment 18 Martin Sivák 2015-05-26 16:22:43 UTC
I missed part of your comment…

What you see is indeed the permissiveness we have in the state machine. The stats collector reports that the engine is down and the agent enters the EngineUpBadHealth state. There is an timeout there that waits for the engine to recover (for 5 minutes). If it recovers, it moves back to up and the agent treats it as if nothing bad happened.

And to answer your question:

Yes, migrating the engine from the web interface should work. As should maintenance mode for a Hosted Engine host.

I will close this as a duplicate to the bug I referenced earlier (because it was cause by the same issue although it does not look like that). Glad to be able to help.

*** This bug has been marked as a duplicate of bug 1150600 ***


Note You need to log in before you can comment on or make changes to this bug.