Hello, In our installation of oVirt consisting in two nodes, with the Hosted Engine installed in both. The Hosted Engine is continuosly (about each hour) going from one node to the other. I show the relevant part of agent.log: MainThread::INFO::2015-05-20 22:04:02,204::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400) MainThread::INFO::2015-05-20 22:04:02,205::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host victorcito.virt.seguridadea1.com (id: 2, score: 2400) MainThread::INFO::2015-05-20 22:04:14,721::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400) MainThread::INFO::2015-05-20 22:04:14,722::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host victorcito.virt.seguridadea1.com (id: 2, score: 2400) MainThread::INFO::2015-05-20 22:04:26,464::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400) MainThread::INFO::2015-05-20 22:04:26,464::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host victorcito.virt.seguridadea1.com (id: 2, score: 2400) MainThread::INFO::2015-05-20 22:04:38,220::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400) MainThread::INFO::2015-05-20 22:04:38,220::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host victorcito.virt.seguridadea1.com (id: 2, score: 2400) MainThread::INFO::2015-05-20 22:04:50,247::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400) MainThread::INFO::2015-05-20 22:04:50,247::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host victorcito.virt.seguridadea1.com (id: 2, score: 2400) MainThread::INFO::2015-05-20 22:05:00,329::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1432152300.33 type=state_transition detail=EngineUp-EngineUpBadHealth hostname='copia.virt.seguridadea1.com' MainThread::INFO::2015-05-20 22:05:00,709::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUp-EngineUpBadHealth) sent? sent MainThread::INFO::2015-05-20 22:05:02,264::states::576::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to bad engine health at Wed May 20 22:05:02 2015 MainThread::INFO::2015-05-20 22:05:02,264::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUpBadHealth (score: 0) MainThread::INFO::2015-05-20 22:05:02,264::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host victorcito.virt.seguridadea1.com (id: 2, score: 2400) MainThread::ERROR::2015-05-20 22:05:12,318::states::562::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine VM has bad health status, timeout in 300 seconds MainThread::INFO::2015-05-20 22:05:12,320::states::576::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to bad engine health at Wed May 20 22:05:12 2015 MainThread::ERROR::2015-05-20 22:05:12,320::states::385::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host victorcito.virt.seguridadea1.com (id 2) score is significantly better than local score, shutting down VM on this host MainThread::INFO::2015-05-20 22:05:12,336::state_decorators::88::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineUpBadHealth'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'> MainThread::INFO::2015-05-20 22:05:12,354::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1432152312.35 type=state_transition detail=EngineUpBadHealth-EngineStop hostname='copia.virt.seguridadea1.com' MainThread::INFO::2015-05-20 22:05:13,224::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUpBadHealth-EngineStop) sent? sent MainThread::INFO::2015-05-20 22:05:15,240::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineStop (score: 2400) MainThread::INFO::2015-05-20 22:05:15,241::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host victorcito.virt.seguridadea1.com (id: 2, score: 2400) MainThread::INFO::2015-05-20 22:05:25,287::hosted_engine::956::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Shutting down vm using `/usr/sbin/hosted-engine --vm-shutdown` MainThread::INFO::2015-05-20 22:05:25,636::hosted_engine::961::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stdout: Machine shutting down MainThread::INFO::2015-05-20 22:05:25,637::hosted_engine::962::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stderr: MainThread::ERROR::2015-05-20 22:05:25,637::hosted_engine::970::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Engine VM stopped on localhost MainThread::INFO::2015-05-20 22:05:25,657::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout set to Wed May 20 22:10:25 2015 while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'> MainThread::INFO::2015-05-20 22:05:27,426::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineStop (score: 2400) MainThread::INFO::2015-05-20 22:05:27,427::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host victorcito.virt.seguridadea1.com (id: 2, score: 2400) MainThread::INFO::2015-05-20 22:05:37,460::states::532::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Waiting on shutdown to complete (12 of 300 seconds) MainThread::INFO::2015-05-20 22:05:39,164::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineStop (score: 2400) MainThread::INFO::2015-05-20 22:05:39,164::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host victorcito.virt.seguridadea1.com (id: 2, score: 2400) And the other relevant part of broker.log: Thread-86306::INFO::2015-05-20 22:04:40,187::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-86307::INFO::2015-05-20 22:04:40,191::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-86307::INFO::2015-05-20 22:04:40,204::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-4::INFO::2015-05-20 22:04:40,344::mem_free::53::mem_free.MemFree::(action) memFree: 13582 Thread-6::INFO::2015-05-20 22:04:44,322::engine_health::110::engine_health.CpuLoadNoEngine::(action) VM is up on this host with healthy engine Thread-2::INFO::2015-05-20 22:04:47,422::ping::52::ping.Ping::(action) Successfully pinged 10.7.1.254 Thread-86308::INFO::2015-05-20 22:04:49,860::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-86308::INFO::2015-05-20 22:04:49,872::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-86309::INFO::2015-05-20 22:04:49,875::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-86309::INFO::2015-05-20 22:04:49,882::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-86310::INFO::2015-05-20 22:04:49,883::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-86310::INFO::2015-05-20 22:04:49,891::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-86311::INFO::2015-05-20 22:04:50,411::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-86311::INFO::2015-05-20 22:04:50,418::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-86312::INFO::2015-05-20 22:04:50,419::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-86312::INFO::2015-05-20 22:04:50,420::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-86313::INFO::2015-05-20 22:04:50,421::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-86313::INFO::2015-05-20 22:04:50,436::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-4::INFO::2015-05-20 22:04:50,456::mem_free::53::mem_free.MemFree::(action) memFree: 13583 Thread-6::WARNING::2015-05-20 22:04:54,734::engine_health::116::engine_health.CpuLoadNoEngine::(action) bad health status: Hosted Engine is not up! Thread-86314::INFO::2015-05-20 22:05:00,543::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-86314::INFO::2015-05-20 22:05:00,590::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-86315::INFO::2015-05-20 22:05:00,591::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-86316::INFO::2015-05-20 22:05:00,593::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-86315::INFO::2015-05-20 22:05:00,595::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-86316::INFO::2015-05-20 22:05:00,595::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-4::INFO::2015-05-20 22:05:00,622::mem_free::53::mem_free.MemFree::(action) memFree: 13585 Thread-86317::INFO::2015-05-20 22:05:05,590::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established oVirt Release 3.5.2 The hosted engine is on an NFS data domain. The remaining virtual machines are running perfectly on both nodes, and are on the same NFS server (different path obviously). The only thing I did to de Hosted Engine from the default installation is changing the network interface, but it seems that that's not the problem. Additional info: hosted-engine.conf: fqdn=engine.virt.seguridadea1.com vm_disk_id=320fc5bb-5b5d-4564-8d2f-9cce5584011f vmid=f5db5eb7-6062-4e60-97f2-3eb73833eede storage=192.168.20.202:/ovirtengine conf=/etc/ovirt-hosted-engine/vm.conf service_start_time=0 host_id=1 console=vnc domainType=nfs3 spUUID=cff6faab-c9b0-4d9f-8b02-3cf2999be9c7 sdUUID=c7a9796d-34ea-4ea6-9197-1053d2ca00b1 connectionUUID=f3bddec8-6455-4013-82ca-d2fd628c02cb ca_cert=/etc/pki/vdsm/libvirt-spice/ca-cert.pem ca_subject="C=EN, L=Test, O=Test, CN=Test" vdsm_use_ssl=true gateway=10.7.1.254 bridge=serv_internos metadata_volume_UUID=f51bd0ac-a879-45c3-a7ee-c42502339944 metadata_image_UUID=b4fd000c-6e9b-4a77-b1e8-eed7d2ba93c6 lockspace_volume_UUID=ab8b4015-a713-4b42-8b35-e90fedae4988 lockspace_image_UUID=5dc8a80f-5a8c-47e0-b2dc-4552b0eae5a6 vm.conf: vmId=f5db5eb7-6062-4e60-97f2-3eb73833eede memSize=4096 display=vnc devices={index:2,iface:ide,address:{ controller:0, target:0,unit:0, bus:1, type:drive},specParams:{},readonly:true,deviceId:d620e110-17c2-4eee-b693-b3b52db8aa6c,path:/opt/CentOS-6.6-x86_64-minimal.iso,device:cdrom,shared:false,type:disk} devices={index:0,iface:virtio,format:raw,poolID:00000000-0000-0000-0000-000000000000,volumeID:cc12a380-c61f-42ea-a123-a236b50b30a9,imageID:320fc5bb-5b5d-4564-8d2f-9cce5584011f,specParams:{},readonly:false,domainID:c7a9796d-34ea-4ea6-9197-1053d2ca00b1,optional:false,deviceId:320fc5bb-5b5d-4564-8d2f-9cce5584011f,address:{bus:0x00, slot:0x06, domain:0x0000, type:pci, function:0x0},device:disk,shared:exclusive,propagateErrors:off,type:disk,bootOrder:1} devices={device:scsi,model:virtio-scsi,type:controller} devices={nicModel:pv,macAddr:00:16:3e:7c:6b:0c,linkActive:true,network:serv_internos,filter:vdsm-no-mac-spoofing,specParams:{},deviceId:88eaaf56-af3d-4b80-811e-927f5e6c2d56,address:{bus:0x00, slot:0x03, domain:0x0000, type:pci, function:0x0},device:bridge,type:interface} devices={device:console,specParams:{},type:console,deviceId:b6d27e83-71e7-4903-957d-47459ed23ad9,alias:console0} vmName=HostedEngine spiceSecureChannels=smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir smp=2 cpuType=Conroe emulatedMachine=pc Thanks.
I comfirm that the engine VM does not stop by it self. It is stoped from outside gracefully, if you are logged in it you can see something like: [root@engine ~]# Broadcast message from root.seguridadea1.com (unknown) at 22:05 ... The system is going down for halt in 2 minutes! "VM is shutting down!" Broadcast message from root.seguridadea1.com (unknown) at 22:06 ... The system is going down for halt in 1 minute! "VM is shutting down!" Broadcast message from root.seguridadea1.com (unknown) at 22:07 ... The system is going down for halt NOW! "VM is shutting down!" Connection to 192.168.20.160 closed by remote host. Connection to 192.168.20.160 closed. ------------ At 22:05 is the moment where it goes to BadHealth state and the broker.log says: Thread-6::WARNING::2015-05-20 22:04:54,734::engine_health::116::engine_health.CpuLoadNoEngine::(action) bad health status: Hosted Engine is not up! But it is UP! Thanks for your help. Regards.
Thanks for the report, Mario. Can you please attach all relevant log files (with host names or ID)?
This looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1150600 Please attach the logs or test with upgraded hosted engine package. There can be a periodic job that loads your host every hour and the status reports that engine could not be reached. The fix makes it obey the grace period (5 minutes iirc) so it has time to recover before actually forcing migration.
Hello, I have read the other bug and I think is not a duplicate. I did some research in the code of the broker and the problem seems to be very random. I put some log code in check_liveliness.py, because I think that is the responsible of detect the Engine as down... I attach a greped log of every faulty response of the liveliness check of it in the file liveliness.py. The logs have only the day of today (25/05/2015): The service notify me by mail these states transitions: 13:26 - EngineUp-EngineUpBadHealth 13:26 - EngineUpBadHealth - EngineStop 13:29 - EngineStop - EngineDown 13:29 - EngineDown - EngineStart 13:29 - EngineStart - EngineStarting 13:29 - EngineStarting - EngineStop 13:42 - EngineStop - EngineDown I put here the log of the calls of check_liveliness.py on Host1 (called copia.virt.... ): 2015-05-25 09:58:15.383008 : NOT RUNNING 2015-05-25 11:46:10.874021 : NOT RUNNING 2015-05-25 13:26:06.289082 : NOT RUNNING 2015-05-25 13:28:56.461911 : NOT RUNNING 2015-05-25 13:29:10.253072 : NOT RUNNING 2015-05-25 14:11:00.464599 : NOT RUNNING 2015-05-25 14:11:10.043068 : NOT RUNNING 2015-05-25 14:11:21.482542 : NOT RUNNING 2015-05-25 14:11:29.727406 : NOT RUNNING 2015-05-25 15:43:20.805004 : NOT RUNNING 2015-05-25 15:46:04.835796 : NOT RUNNING 2015-05-25 15:46:12.351592 : NOT RUNNING 2015-05-25 16:28:21.914958 : NOT RUNNING 2015-05-25 16:28:32.159975 : NOT RUNNING 2015-05-25 16:28:45.626845 : NOT RUNNING 2015-05-25 16:28:52.157795 : NOT RUNNING 2015-05-25 16:29:01.648473 : NOT RUNNING I provide both agent.log and broker.log of the two hosts. Regards. Mario.
Created attachment 1029507 [details] agent.log Host1 (copia.virt..)
Created attachment 1029509 [details] agent.log Host2 (victorcito.virt...)
Created attachment 1029522 [details] broker.log Host1 (copia.virt...)
Created attachment 1029523 [details] broker.log Host2 (victorcito.virt...)
Hello again, I have upgraded the ovirt-hosted-engine-ha to the last version on GitHub (1.2.6-2) and the problem persists. Regards. Mario.
Can you give us the logs from the upgraded system? The logs #5-#8 are from the older version (score is reduced to 0 when in BadHealth state). We removed the score penalty in 1.2.6-2 so it should have a different reason now.
Hello Martin, I did some research on the logs, now (for the moment) the Engine only goes from EngineUp to EngineUpBadHealth. I show you the state transitions of today: 08:56 - EngineUp-EngineUpBadHealth 08:56 - EngineUpBadHealth-EngineUp 09:25 - EngineUp-EngineUpBadHealth 09:25 - EngineUpBadHealth-EngineUp 09:30 - EngineUp-EngineUpBadHealth 09:30 - EngineUpBadHealth-Engineup 15:06 - EngineUp-EngineUpBadHealth 15:06 - EngineUpBadHealth-EngineUp And this is the logs of the failed calls to check_liveliness.py: 2015-05-26 08:56:15.849889 : ENGINE DOWN 2015-05-26 09:25:06.897624 : ENGINE DOWN 2015-05-26 09:30:08.378101 : ENGINE DOWN 2015-05-26 09:45:18.621886 : ENGINE DOWN 2015-05-26 15:06:25.075821 : ENGINE DOWN We saw that now the engine don't goes down, but now the problem I think that is that the check_liveliness should be more permisive and give more time before telling that the engine is down. Because 2 or 3 seconds after telling it's down tell it's up again. I also attach the logs, the host1 (copia.virt...) is the host with the engine running. I have one question, It's correct to migrate the engine throught the Web Admin interface?? Thank you. Regards.
Created attachment 1030011 [details] agent.log Host1 (copia.virt..) 1.2.6-2
Created attachment 1030012 [details] broker.log Host2 (victorcito.virt...) 1.2.6-2
Created attachment 1030013 [details] broker.log Host1 (copia.virt...) 1.2.6-2
Created attachment 1030016 [details] agent.log Host2 (victorcito.virt...) 1.2.6-2
Hi, thanks for the logs. I do not see any migration or start/stop logs in what you just attached to the bugzilla. Are you sure your engine still migrates around? I see the VM is: - running on Host1 - going to EngineUpBadHealth couple of times - staying there for about 10-20 seconds - and then returning to EngineUp Which is exactly what is supposed to happen when there is an issue with retrieving the status page due to CPU load (for example).
Hi, Martin. Yes after looking at the logs, now seems that at this version (1.2.6-2) the problem of migration seems to be solved. Maybe the bug can be marked as solved. Thanks for your help. Regards.
I missed part of your comment… What you see is indeed the permissiveness we have in the state machine. The stats collector reports that the engine is down and the agent enters the EngineUpBadHealth state. There is an timeout there that waits for the engine to recover (for 5 minutes). If it recovers, it moves back to up and the agent treats it as if nothing bad happened. And to answer your question: Yes, migrating the engine from the web interface should work. As should maintenance mode for a Hosted Engine host. I will close this as a duplicate to the bug I referenced earlier (because it was cause by the same issue although it does not look like that). Glad to be able to help. *** This bug has been marked as a duplicate of bug 1150600 ***