Bug 1238823 - hosted-engine --vm-status results into python exception
Summary: hosted-engine --vm-status results into python exception
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: General
Version: ---
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ovirt-3.5.5
: 1.2.7.2
Assignee: Martin Sivák
QA Contact: Elad
URL:
Whiteboard: sla
Depends On:
Blocks: 1263111
TreeView+ depends on / blocked
 
Reported: 2015-07-02 17:57 UTC by Matteo Brancaleoni
Modified: 2016-06-23 08:24 UTC (History)
18 users (show)

Fixed In Version: ovirt-hosted-engine-ha-1.2.7.2-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-10-26 13:43:49 UTC
oVirt Team: SLA
Embargoed:
ylavi: ovirt-3.5.z?
ylavi: planning_ack+
rule-engine: devel_ack+
rule-engine: testing_ack?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 1614463 0 None None None Never
oVirt gerrit 44686 0 master MERGED Fill VDSM volumes with zeros before using them for the first time Never
oVirt gerrit 45305 0 ovirt-hosted-engine-ha-1.2 MERGED Fill VDSM volumes with zeros before using them for the first time Never

Description Matteo Brancaleoni 2015-07-02 17:57:47 UTC
Description of problem:

A fresh install of oVirt 3.5, made following this guide:
http://community.redhat.com/blog/2014/10/up-and-running-with-ovirt-3-5/
but using iSCSI for hosted-engine deployment instead of glusterfs,
on a single node (for now).

CentOS 7 is the host, CentOS 6 for the vm holding the engine.

Everything works, I've some vms running ok, the engine seems ok, but:

* in the web gui, the Hosted Engine HA is reported as "Not Active"

* from the cli, "hosted-engine --check-liveliness" returns "Hosted Engine is up!" but "hosted-engine --vm-status" fails with a python Exception:

Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 116, in <module>
    if not status_checker.print_status():
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 59, in print_status
    all_host_stats = ha_cli.get_all_host_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 155, in get_all_host_stats
    return self.get_all_stats(self.StatModes.HOST)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 102, in get_all_stats
    stats = self._parse_stats(stats, mode)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 141, in _parse_stats
    md = metadata.parse_metadata_to_dict(host_id, data)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/metadata.py", line 147, in parse_metadata_to_dict
    constants.METADATA_FEATURE_VERSION))
ovirt_hosted_engine_ha.lib.exceptions.FatalMetadataError: Metadata version 9 from host 5 too new for this agent (highest compatible version: 1)

the "metadata" file, which is really a device under "/rhev/data-center/mnt/blockSD/d4af11cf-c656-40ca-bd42-d81cd7738a6b/ha_agent/hosted-engine.metadata" if observed contains some metadata mixed with a lot of garbage (read: binary data, don't know if expected but according to other discussions there should be only some readable metadata?)

If I restart the agent and the broker, I have some notifications that reports the following transitions:

StartState-ReinitializeFSM
ReinitializeFSM-EngineStarting
EngineStarting-EngineUP

The engine is working ok, but the anomalies reported persists.

Any hint on what to check?

Comment 1 Matteo Brancaleoni 2015-07-03 14:18:48 UTC
CC from the ovirt-users mailing list:

I blindly tried the following:

* check the name of the block device used for metadata
* then shutdown the engine vm
* then stop agent and broker on first host
 and finally zeroed the block device with 
 dd if=/dev/zero of=/dev/dm-12

 (dm-12 is the block device pointed by metadata file)

started again broker and agent, after a while the engine
was started by HA and the metadata was readable.

now hosted_engine --vm-status works ok and I was able
to add a 2nd node to the cluster.

Also web GUI now reports Hosted Engine HA as Active

Maybe the metadata block device needs to be cleared when doing iSCSI setup?

Don't know if this is correct, but seems to work ok now.

Comment 2 Doron Fediuck 2015-08-11 11:37:16 UTC
In 3.6.0 there will be an option to fix the metadata in case of such issues.
The root cause here was a mixture of new meatadata with an older agent, which
should not happen.

Comment 3 Martin Sivák 2015-08-11 11:46:39 UTC
Actually, the iSCSI volume needs to be wiped out before we start using it as VDSM does not do that automatically.

Comment 4 Sandro Bonazzola 2015-08-19 08:33:25 UTC
Patch has been merged, please move to modified if no other change is requred.

Comment 5 Ilanit Stein 2015-09-02 09:06:46 UTC
Can you please add steps to reproduce?
ISCSI storage specific? centOS specific?

Comment 6 Martin Sivák 2015-09-02 10:04:42 UTC
iSCSI specific. The reproducer is quite simple. Standard hosted engine installation is needed and a non-clean iSCSI disk has to be used for the storage. Or take an existing install and fill the ha_agent.metadata with random data.

Comment 7 Red Hat Bugzilla Rules Engine 2015-10-18 08:34:17 UTC
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 8 Elad 2015-10-18 11:01:07 UTC
Deployment over iSCSI non-clean LUN finished successfully with vt17.3. Used the following:
ovirt-hosted-engine-ha-1.2.7.2-1.el7ev.noarch
ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch.

Comment 9 Sandro Bonazzola 2015-10-26 13:43:49 UTC
oVirt 3.5.5 has been released including fixes for this issue.

Comment 10 Yedidyah Bar David 2016-06-15 12:28:58 UTC
(In reply to Elad from comment #8)
> Deployment over iSCSI non-clean LUN finished successfully with vt17.3. Used
> the following:
> ovirt-hosted-engine-ha-1.2.7.2-1.el7ev.noarch
> ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch.

Any chance to find out how this bug was verified?

I am pretty certain that the fix is wrong for iSCSI at least and wonder if it was really verified. See also bug 1346341 and likely also bug 1314522 and other similar reports.

Comment 11 Elad 2016-06-23 08:01:26 UTC
The bug was verified according to the steps in comment 6

Comment 12 Yedidyah Bar David 2016-06-23 08:12:41 UTC
(In reply to Elad from comment #11)
> The bug was verified according to the steps in comment 6

How did you force a non-clean disk?

Comment 13 Elad 2016-06-23 08:22:52 UTC
(In reply to Yedidyah Bar David from comment #12)
> (In reply to Elad from comment #11)
> > The bug was verified according to the steps in comment 6
> 
> How did you force a non-clean disk?

Deployed HE and re-deployed over the same LUN.

Comment 14 Yedidyah Bar David 2016-06-23 08:24:55 UTC
OK. For bug 1346341 we'll provide more detailed instructions. Thanks!


Note You need to log in before you can comment on or make changes to this bug.