Bug 1479823

Summary:	SHE HA agent could log better when HE VM dies
Product:	[oVirt] ovirt-hosted-engine-ha	Reporter:	Jiri Belka <jbelka>
Component:	Agent	Assignee:	bugs <bugs>
Status:	CLOSED NOTABUG	QA Contact:	meital avital <mavital>
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	2.1.0.6	CC:	bugs
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-09 13:56:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	SLA	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jiri Belka 2017-08-09 13:21:23 UTC

Description of problem:

While checking logs why our HE VM went down, I observed HA log does not state clearly that HE VM which was running on the node is down, I would expect nice log message which would just ping me in my nose.

qemu log:

...
2017-08-08 07:01:26.123+0000: shutting down, reason=crashed
...

some events from agent.log:

MainThread::DEBUG::2017-08-08 09:01:34,565::brokerlink::152::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(get_monitor_status) Success, status {"health": "good", "vm": "up", "detail": "up"}
MainThread::DEBUG::2017-08-08 09:02:09,105::brokerlink::152::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(get_monitor_status) Success, status {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"}
^^ DEBUG

  it seems this is first INFO level (why INFO?) about HE VM down

MainThread::INFO::2017-08-08 09:02:12,969::state_machine::169::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {'maintenance': False} 
MainThread::INFO::2017-08-08 09:02:12,969::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host slot-2.example.com (id 1): {'con        f_on_shared_storage': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=4560046 (Tue Aug  8 09:01:29 2017)\nhost-id=1\nscore=3000\nvm_conf_refresh_time=4        560070 (Tue Aug  8 09:01:53 2017)\nconf_on_shared_storage=True\nmaintenance=False\nstate=EngineDown\nstopped=False\n', 'hostname': 'slot-2.rhev.lab.eng.brq.redhat.com', 'alive': True,         'host-id': 1, 'engine-status': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'score': 3000, 'stopped': False, 'maintenance': False, '        crc32': '73a15792', 'local_conf_timestamp': 4560070, 'host-ts': 4560046}
MainThread::INFO::2017-08-08 09:02:12,969::state_machine::177::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 2): {'engine-health': {'reason': 'bad vm st        atus', 'health': 'bad', 'vm': 'down', 'detail': 'down'}, 'bridge': True, 'mem-free': 17691.0, 'maintenance': False, 'cpu-load': 0.1431, 'gateway': True}
...and...
MainThread::INFO::2017-08-08 09:02:12,969::states::402::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm may be running on another host

^^ is this to be first human readable event informing HE VM is down?

MainThread::INFO::2017-08-08 09:02:12,973::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1502175732.97 type=state_transition detail=EngineUp-EngineMaybeAway hostname='slot-5b.example.com'

^^ iiuc this is about mail notification being sent

534271:MainThread::INFO::2017-08-08 09:03:56,476::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1502175836.48 type=state_transition detail=EngineMaybeAway-EngineUnexpectedlyDown hostname='slot-5b.example.com'
...
MainThread::INFO::2017-08-08 09:04:24,977::states::678::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to unexpected vm shutdown at Tue Aug  8 09:04:24 2017
MainThread::INFO::2017-08-08 

^^ iiuc this is host penalization for vm crash

09:04:24,977::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUnexpectedlyDown (score: 0)
MainThread::INFO::2017-08-08 09:04:28,923::states::663::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down, local host does not have best score


I'd love to see some better wording for first discover that HE VM died and maybe better log level that INFO?

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.1.0.6-1.el7ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. run SHE and kill ugly HE VM
2. observe agent.log on the host
3.

Actual results:
not really human readable and easy to find out from agent.log when HA agent discovered HE VM is down and starts its activities

Expected results:
nice readable log event with better log level

Additional info:

Comment 2 Jiri Belka 2017-08-09 13:25:57 UTC

See that for starting HE VM on other node log message is quite OK:

MainThread::INFO::2017-08-08 09:04:54,337::states::493::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (3000), attempting to start engine VM

Comment 3 Martin Sivák 2017-08-09 13:56:32 UTC

TLDR: Hosted engine deals with cluster-wide status and so it takes its time before deciding on what happened. Realtime status of VM is monitored by qemu, libvirt and vdsm.

> 09:02:12,969::states::402::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm may be running on another host
>
> ^^ is this to be first human readable event informing HE VM is down?

No, the qemu log already told you (as did libvirt and vdsm). And so did journalctl I hope (as it aggregates the logs).

Hosted engine deals with cluster wide status. It also only checks couple of times per minute and logs the state machine decisions. So here it told you it needs to check whether the VM is running elsewhere before doing anything else.

> 09:04:24,977::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUnexpectedlyDown (score: 0)
> MainThread::INFO::2017-08-08 09:04:28,923::states::663::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down, local host does not have best score

This is the final decision. The transition took 30 seconds due to storage checks, but we just fixed that. It is configured as NOWAIT in the code.




We can't tell if the VM crashed or if the user shut it down using hosted-engine --vm-stop, vds-cli or virsh. We actually do not care here... the job is to make sure the engine is up somewhere.

> I'd love to see some better wording for first discover that HE VM died and 

Aren't qemu/libvirt/vdsm logs enough? We might add a sentence somewhere, but journal already has the same info at least three times.

> maybe better log level that INFO?

INFO is the appropriate level from agent's perspective. WARN / ERROR will be
present in the relevant logs for the VM (vdsm, libvirt).

We could maybe add a WARN level, but we will be outputting warning during almost every migration or manual shutdown, because it will look the same as a crash (until we get access to lifecycle events). So you would be getting:

- WARN engine died!
- oh wait, no, ignore what I just said. It just properly migrated faster than we expected

We do not need so much redundancy in logging, what we need is common structured logging using syslog or journal to be able to correlate information.