Created attachment 1350333 [details] Agent and Broker logs We lost the backend storage that hosts our self hosted engine tonight. We've recovered it and there was no data corruption on the volume containing the HE disk. However, when we try to start the HE it doesn't give an error, but it also doesn't start. The VM isn't pingable and the liveliness check always fails. [root@ovirttest1 ~]# hosted-engine --vm-status | grep -A20 ovirttest1 Hostname : ovirttest1.wolfram.com Host ID : 1 Engine status : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : 2c2f3ec9 local_conf_timestamp : 18980042 Host timestamp : 18980039 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=18980039 (Fri Nov 10 01:17:59 2017) host-id=1 score=3400 vm_conf_refresh_time=18980042 (Fri Nov 10 01:18:03 2017) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False The environment is in Global Maintenance so that we can isolate it to starting on a specific host to eliminate as many variables as possible. I've attached the agent and broker logs
It seems that the daemons were trying to start the HE VM, but failed: Agent log: Thread-939::INFO::2017-11-10 00:11:45,426::engine_health::80::engine_health.CpuLoadNoEngine::(action) VM status: powering up Broker log: MainThread::INFO::2017-11-10 00:12:14,879::state_machine::177::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 1): {'engine-health': {'reason': 'bad vm status', 'health': 'bad', 'vm': 'up', 'detail': 'powering up'}, 'bridge': True, 'mem-free': 193754.0, 'maintenance': False, 'cpu-load': 0.0647, 'gateway': 1.0} Can you please provide the vdsm log from the relevant time?
Created attachment 1351211 [details] vdsm log We managed to resurrect it last night, but this is the log from when shortly after it began. The way we did it was we managed to get a console and boot it via a livecd, fix the filesystem Logan
So, from the logs we can see the VM process started and was running until killed, probably by the admin: VM started: ---------------- 2017-11-10 01:09:28,010-0600 INFO (libvirt/events) [virt.vm] (vmId='32116ed2-da80-4e03-8f39-6435339ba674') CPU running: onResume (vm:5093) 2017-11-10 01:09:28,030-0600 INFO (vm/32116ed2) [virt.vm] (vmId='32116ed2-da80-4e03-8f39-6435339ba674') Starting connection (guestagent:245) 2017-11-10 01:09:28,032-0600 INFO (vm/32116ed2) [virt.vm] (vmId='32116ed2-da80-4e03-8f39-6435339ba674') CPU running: domain initialization (vm:5093) Killing the VM: ---------------- 2017-11-10 00:10:50,654-0600 INFO (jsonrpc/5) [virt.vm] (vmId='32116ed2-da80-4e03-8f39-6435339ba674') Release VM resources (vm:4298) 2017-11-10 00:10:50,665-0600 INFO (jsonrpc/5) [virt.vm] (vmId='32116ed2-da80-4e03-8f39-6435339ba674') Stopping connection (guestagent:430) 2017-11-10 00:10:50,665-0600 INFO (jsonrpc/5) [virt.vm] (vmId='32116ed2-da80-4e03-8f39-6435339ba674') _destroyVmGraceful attempt #0 (vm:4334) 2017-11-10 00:10:50,979-0600 INFO (libvirt/events) [virt.vm] (vmId='32116ed2-da80-4e03-8f39-6435339ba674') underlying process disconnected (vm:694) 2017-11-10 00:10:50,980-0600 INFO (jsonrpc/5) [vdsm.api] START teardownImage(sdUUID='af785b15-3e7a-4775-a28e-6f9da8eb154d', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='405449c7-09dc-4dc7-9eee-1bbd6be42eb6', volUUID=None) from=::1,36802, task_id=09ac8c5d-8c2f-4ad2-89b6-d5a1c4e2dbe2 (api:46) 2017-11-10 00:10:50,981-0600 INFO (jsonrpc/5) [vdsm.api] FINISH teardownImage return=None from=::1,36802, task_id=09ac8c5d-8c2f-4ad2-89b6-d5a1c4e2dbe2 (api:52) 2017-11-10 00:10:50,981-0600 INFO (jsonrpc/5) [virt.vm] (vmId='32116ed2-da80-4e03-8f39-6435339ba674') Stopping connection (guestagent:430) 2017-11-10 00:10:50,981-0600 WARN (jsonrpc/5) [root] File: /var/lib/libvirt/qemu/channels/32116ed2-da80-4e03-8f39-6435339ba674.com.redhat.rhevm.vdsm already removed (utils:120) 2017-11-10 00:10:50,982-0600 WARN (jsonrpc/5) [root] File: /var/lib/libvirt/qemu/channels/32116ed2-da80-4e03-8f39-6435339ba674.org.qemu.guest_agent.0 already removed (utils:120) 2017-11-10 00:10:50,982-0600 INFO (jsonrpc/5) [vdsm.api] START inappropriateDevices(thiefId='32116ed2-da80-4e03-8f39-6435339ba674') from=::1,36802, task_id=6ab54934-cdd1-48d4-92c6-0262b30aab49 (api:46) 2017-11-10 00:10:50,984-0600 INFO (jsonrpc/5) [vdsm.api] FINISH inappropriateDevices return=None from=::1,36802, task_id=6ab54934-cdd1-48d4-92c6-0262b30aab49 (api:52) 2017-11-10 00:10:50,985-0600 INFO (libvirt/events) [virt.vm] (vmId='32116ed2-da80-4e03-8f39-6435339ba674') Changed state to Down: Admin shut down from the engine (code=6) (vm:1259)
Anything else we can do here?
No, but thank you for your time.