Bug 1486579

Summary: [downstream clone - 4.1.6] hosted-engine --upgrade-appliance fails with KeyError: 'stopped' if the metadata area contains references to 3.5 decommissioned hosts
Product: Red Hat Enterprise Virtualization Manager Reporter: rhev-integ
Component: ovirt-hosted-engine-setupAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED ERRATA QA Contact: Artyom <alukiano>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1.5CC: apinnick, lsurette, pstehlik, stirabos, ykaul, ylavi
Target Milestone: ovirt-4.1.6Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
The Hosted Engine's upgrade to 4.0/RHEL7 failed when there were references to 3.5 hosts in the metadata of the shared storage. Now, a process removes obsolete host references and the upgrade succeeds.
Story Points: ---
Clone Of: 1481680 Environment:
Last Closed: 2017-09-19 07:18:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1481680    
Bug Blocks: 1484761    

Description rhev-integ 2017-08-30 08:04:29 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1481680 +++
======================================================================

Description of problem:
hosted engine should warn admin that there is not enough space when upgrading appliance

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup-2.1.3.5-1.el7ev.noarch

How reproducible:
n/a

Steps to Reproduce:
1. deploy rhev hosted engine 3.6, choose size for the hosted engine vm so there will be only 40 GB left free on hosted engine storage domain
2.run hosted-engine --upgrade-appliance


Actual results:
setup fails with mesage "Hosted Engine upgrade failed"

Expected results:
setup will warn admin that there is not enough disk space on the hosted engine storage domain

Additional info:

2017-08-14 16:18:07 INFO otopi.plugins.gr_he_upgradeappliance.engine.misc misc._check_sd_and_disk_space:236 The hosted-engine storage domain has enough free space to contain a new backup disk.
2017-08-14 16:18:07 WARNING otopi.plugins.gr_he_upgradeappliance.engine.misc misc._check_sd_and_disk_space:252 On the hosted-engine disk there is not enough available space to fit the new appliance disk: required 50GiB - available 40GiB. 
.
.
.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-upgradeappliance/core/misc.py", line 303, in _validata_lm_volumes
    stopped = status['all_host_stats'][h]['stopped']
KeyError: 'stopped'

(Originally by Marian Jankular)

Comment 1 rhev-integ 2017-08-30 08:04:34 UTC
Let's try to recap:
 2017-08-14 16:17:45 DEBUG otopi.ovirt_hosted_engine_setup.domains domains.check_available_space:116 Available space on /var/tmp is 42784Mb
 2017-08-14 16:17:45 DEBUG otopi.context context.dumpEnvironment:760 ENVIRONMENT DUMP - BEGIN
 2017-08-14 16:17:45 DEBUG otopi.context context.dumpEnvironment:770 ENV OVEHOSTED_STORAGE/ovfSizeGB=int:'50'
 2017-08-14 16:17:45 DEBUG otopi.context context.dumpEnvironment:770 ENV OVEHOSTED_STORAGE/qcowSizeGB=int:'4'

The OVF image is 4 GB and in /var/tmp we have 42784Mb free so there is enough space to extract there the image from the OVA archive.

On the hosted-engine SD instead you have 
 2017-08-14 16:18:07 DEBUG otopi.plugins.gr_he_upgradeappliance.engine.misc misc._check_sd_and_disk_space:202 Successfully connected to the engine
 2017-08-14 16:18:07 DEBUG otopi.plugins.gr_he_upgradeappliance.engine.misc misc._check_sd_and_disk_space:211 availalbe: 141733920768

Which are 132 GB so no issue there and indeed:
 2017-08-14 16:18:07 INFO otopi.plugins.gr_he_upgradeappliance.engine.misc misc._check_sd_and_disk_space:236 The hosted-engine storage domain has enough free space to contain a new backup disk.

The appliance disk is now sized at 50 Gb, we can grow it on the fly but we cannot shrink.

The warning was about the size of the disk of the 3.6 engine VM that was at 40 GB while the new 4.0 appliance requires 50GB but the setup can grow it and so we have just a warning:
 2017-08-14 16:18:07 WARNING otopi.plugins.gr_he_upgradeappliance.engine.misc misc._check_sd_and_disk_space:252 On the hosted-engine disk there is not enough 
 available space to fit the new appliance disk: required 50GiB - available 40GiB.
 2017-08-14 16:18:07 DEBUG otopi.plugins.otopi.dialog.human human.queryString:145 query UPGRADE_DISK_RESIZE_PROCEED
 2017-08-14 16:18:07 DEBUG otopi.plugins.otopi.dialog.human dialog.__logString:204 DIALOG:SEND                 This upgrade tool can resize the hosted-engine VM disk; before resizing a backup will be created.
 2017-08-14 16:18:07 DEBUG otopi.plugins.otopi.dialog.human dialog.__logString:204 DIALOG:SEND                  Are you sure you want to continue? (Yes, No)[Yes]:

And the user accepted to have the setup growing the VM disk automatically.
No issue up to now.

The issue is instead here trying to validate the status of the hosted-engine hosts from the metadata area on the shared storage:
2017-08-14 16:19:15 DEBUG otopi.context context._executeMethod:142 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-upgradeappliance/core/misc.py", line 303, in _validata_lm_volumes
    stopped = status['all_host_stats'][h]['stopped']
KeyError: 'stopped'

And indeed from the logs we can see:

 2: {'engine-status': '{"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}',
  'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=2515837 (Fri Mar 10 13:46:56 2017)\nhost-id=2\nscore=2400\nmaintenance=False\nstate=EngineDown\n',
  'host-id': 2,
  'host-ts': 2515837,
  'hostname': '****02.******.**',
  'live-data': False,
  'maintenance': False,
  'score': 2400},


while we have instead something like this for 3.6 hosts:

 9: {'conf_on_shared_storage': False,
  'crc32': 'd96e718b',
  'engine-status': '{"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}',
  'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=10106154 (Mon Aug 14 16:19:08 2017)\nhost-id=9\nscore=3400\nvm_conf_refresh_time=10106160 (Mon Aug 14 16:19:15 2017)\nconf_on_shared_storage=False\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n',
  'host-id': 9,
  'host-ts': 10106154,
  'hostname': '***09.******.**',
  'live-data': True,
  'local_conf_timestamp': 10106160,
  'maintenance': False,
  'score': 3400,
  'stopped': False},

So there was still a reference to host 02 in metadata area on the shared storage but its structure was still in 3.5 shape missing 'stopped' attribute and so this issue.

We check the datacenter and cluster level from the engine but here everything was fine there:
 2017-08-14 16:18:11 DEBUG otopi.plugins.gr_he_upgradeappliance.engine.misc misc._check_upgrade_requirements:315 Successfully connected to the engine
 2017-08-14 16:18:11 INFO otopi.plugins.gr_he_upgradeappliance.engine.misc misc._check_upgrade_requirements:344 All the datacenters and clusters are at a compatible level
 
The 3.5 hosts have probably been removed from the engine but they are still present in the metadata area on the shared storage and so this issue.

Workaround:
run
 hosted-engine --vm-status
and, one by one, remove all the decommissioned hosts with:
 hosted-engine --clean-metadata --host-id=<id>

(Originally by Simone Tiraboschi)

Comment 3 Sandro Bonazzola 2017-09-04 06:46:48 UTC
Simone, can you please update doc-text?

Comment 5 Artyom 2017-09-07 11:57:48 UTC
Verified on ovirt-hosted-engine-setup-2.1.3.8-1.el7ev.noarch

1) Create environment that has 3.5 hosted-engine metadata for the host that was deleted
2) Run # hosted-engine --upgrade-appliance
[ ERROR ] Metadata for host cyan-vdsf.qa.lab.tlv.redhat.com is incompatible with this tool. Before proceeding with this upgrade, please correctly upgrade it to 3.6 or clean its metadata area with  'hosted-engine --clean-metadata --host-id=2' if decommissioned or not anymore involved in HE.
[ ERROR ] Failed to execute stage 'Environment customization': Host with unsupported metadata area
[ INFO  ] Stage: Clean up
3) Stop ovirt-ha-agent and clean non-exist host metadata
# hosted-engine --clean-metadata --force-clean --host-id=2
4) Run again # hosted-engine --upgrade-appliance - PASS

Comment 8 errata-xmlrpc 2017-09-19 07:18:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2748