1511788 – HostedEngine won't start

Bug 1511788 - HostedEngine won't start

Summary: HostedEngine won't start

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.HostedEngine
Sub Component:
Version:	4.1.6.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Doron Fediuck
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-11-10 07:25 UTC by Logan Kuhn
Modified:	2017-11-20 14:45 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-11-20 14:45:43 UTC
oVirt Team:	SLA
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Agent and Broker logs (1.49 MB, application/zip) 2017-11-10 07:25 UTC, Logan Kuhn	no flags	Details
vdsm log (762.32 KB, application/x-xz) 2017-11-12 14:59 UTC, Logan Kuhn	no flags	Details
View All

Description Logan Kuhn 2017-11-10 07:25:38 UTC

Created attachment 1350333 [details]
Agent and Broker logs

We lost the backend storage that hosts our self hosted engine tonight.  We've recovered it and there was no data corruption on the volume containing the HE disk.  However, when we try to start the HE it doesn't give an error, but it also doesn't start. 

The VM isn't pingable and the liveliness check always fails.

 [root@ovirttest1 ~]# hosted-engine --vm-status | grep -A20 ovirttest1 
Hostname                           : ovirttest1.wolfram.com 
Host ID                            : 1 
Engine status                      : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "up"} 
Score                              : 3400 
stopped                            : False 
Local maintenance                  : False 
crc32                              : 2c2f3ec9 
local_conf_timestamp               : 18980042 
Host timestamp                     : 18980039 
Extra metadata (valid at timestamp): 
       metadata_parse_version=1 
       metadata_feature_version=1 
       timestamp=18980039 (Fri Nov 10 01:17:59 2017) 
       host-id=1 
       score=3400 
       vm_conf_refresh_time=18980042 (Fri Nov 10 01:18:03 2017) 
       conf_on_shared_storage=True 
       maintenance=False 
       state=GlobalMaintenance 
       stopped=False

The environment is in Global Maintenance so that we can isolate it to starting on a specific host to eliminate as many variables as possible.  I've attached the agent and broker logs

Comment 1 Doron Fediuck 2017-11-12 10:29:40 UTC

It seems that the daemons were trying to start the HE VM, but failed:

Agent log:
Thread-939::INFO::2017-11-10 00:11:45,426::engine_health::80::engine_health.CpuLoadNoEngine::(action) VM status: powering up

Broker log:
MainThread::INFO::2017-11-10 00:12:14,879::state_machine::177::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 1): {'engine-health': {'reason': 'bad vm status', 'health': 'bad', 'vm': 'up', 'detail': 'powering up'}, 'bridge': True, 'mem-free': 193754.0, 'maintenance': False, 'cpu-load': 0.0647, 'gateway': 1.0}

Can you please provide the vdsm log from the relevant time?

Comment 2 Logan Kuhn 2017-11-12 14:59:22 UTC

Created attachment 1351211 [details]
vdsm log

We managed to resurrect it last night, but this is the log from when shortly after it began.

The way we did it was we managed to get a console and boot it via a livecd, fix the filesystem

Logan

Comment 3 Doron Fediuck 2017-11-13 10:18:05 UTC

So, from the logs we can see the VM process started and was running until killed, probably by the admin:

VM started:
----------------
2017-11-10 01:09:28,010-0600 INFO  (libvirt/events) [virt.vm] (vmId='32116ed2-da80-4e03-8f39-6435339ba674') CPU running: onResume (vm:5093)
2017-11-10 01:09:28,030-0600 INFO  (vm/32116ed2) [virt.vm] (vmId='32116ed2-da80-4e03-8f39-6435339ba674') Starting connection (guestagent:245)
2017-11-10 01:09:28,032-0600 INFO  (vm/32116ed2) [virt.vm] (vmId='32116ed2-da80-4e03-8f39-6435339ba674') CPU running: domain initialization (vm:5093)

Killing the VM:
----------------
2017-11-10 00:10:50,654-0600 INFO  (jsonrpc/5) [virt.vm] (vmId='32116ed2-da80-4e03-8f39-6435339ba674') Release VM resources (vm:4298)

2017-11-10 00:10:50,665-0600 INFO  (jsonrpc/5) [virt.vm] (vmId='32116ed2-da80-4e03-8f39-6435339ba674') Stopping connection (guestagent:430)

2017-11-10 00:10:50,665-0600 INFO  (jsonrpc/5) [virt.vm] (vmId='32116ed2-da80-4e03-8f39-6435339ba674') _destroyVmGraceful attempt #0 (vm:4334)

2017-11-10 00:10:50,979-0600 INFO  (libvirt/events) [virt.vm] (vmId='32116ed2-da80-4e03-8f39-6435339ba674') underlying process disconnected (vm:694)

2017-11-10 00:10:50,980-0600 INFO  (jsonrpc/5) [vdsm.api] START teardownImage(sdUUID='af785b15-3e7a-4775-a28e-6f9da8eb154d', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='405449c7-09dc-4dc7-9eee-1bbd6be42eb6', volUUID=None) from=::1,36802, task_id=09ac8c5d-8c2f-4ad2-89b6-d5a1c4e2dbe2 (api:46)

2017-11-10 00:10:50,981-0600 INFO  (jsonrpc/5) [vdsm.api] FINISH teardownImage return=None from=::1,36802, task_id=09ac8c5d-8c2f-4ad2-89b6-d5a1c4e2dbe2 (api:52)

2017-11-10 00:10:50,981-0600 INFO  (jsonrpc/5) [virt.vm] (vmId='32116ed2-da80-4e03-8f39-6435339ba674') Stopping connection (guestagent:430)

2017-11-10 00:10:50,981-0600 WARN  (jsonrpc/5) [root] File: /var/lib/libvirt/qemu/channels/32116ed2-da80-4e03-8f39-6435339ba674.com.redhat.rhevm.vdsm already removed (utils:120)

2017-11-10 00:10:50,982-0600 WARN  (jsonrpc/5) [root] File: /var/lib/libvirt/qemu/channels/32116ed2-da80-4e03-8f39-6435339ba674.org.qemu.guest_agent.0 already removed (utils:120)

2017-11-10 00:10:50,982-0600 INFO  (jsonrpc/5) [vdsm.api] START inappropriateDevices(thiefId='32116ed2-da80-4e03-8f39-6435339ba674') from=::1,36802, task_id=6ab54934-cdd1-48d4-92c6-0262b30aab49 (api:46)

2017-11-10 00:10:50,984-0600 INFO  (jsonrpc/5) [vdsm.api] FINISH inappropriateDevices return=None from=::1,36802, task_id=6ab54934-cdd1-48d4-92c6-0262b30aab49 (api:52)

2017-11-10 00:10:50,985-0600 INFO  (libvirt/events) [virt.vm] (vmId='32116ed2-da80-4e03-8f39-6435339ba674') Changed state to Down: Admin shut down from the engine (code=6) (vm:1259)

Comment 4 Yaniv Kaul 2017-11-20 08:59:32 UTC

Anything else we can do here?

Comment 5 Logan Kuhn 2017-11-20 14:40:31 UTC

No, but thank you for your time.

Note You need to log in before you can comment on or make changes to this bug.