Bug 1481680

Summary: hosted-engine --upgrade-appliance fails with KeyError: 'stopped' if the metadata area contains references to 3.5 decommissioned hosts
Product: Red Hat Enterprise Virtualization Manager Reporter: Marian Jankular <mjankula>
Component: ovirt-hosted-engine-setupAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED ERRATA QA Contact: Artyom <alukiano>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1.5CC: bburmest, lsurette, pstehlik, ykaul, ylavi
Target Milestone: ovirt-4.2.0Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Upgrading a hosted engine to 4.0 would fail if references to version 3.5 hosts still existed in the metadata volume of the engine. The user is now warned when this is the case.
Story Points: ---
Clone Of:
: 1486579 (view as bug list) Environment:
Last Closed: 2018-05-15 17:32:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1458709, 1486579    

Description Marian Jankular 2017-08-15 12:45:19 UTC
Description of problem:
hosted engine should warn admin that there is not enough space when upgrading appliance

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup-2.1.3.5-1.el7ev.noarch

How reproducible:
n/a

Steps to Reproduce:
1. deploy rhev hosted engine 3.6, choose size for the hosted engine vm so there will be only 40 GB left free on hosted engine storage domain
2.run hosted-engine --upgrade-appliance


Actual results:
setup fails with mesage "Hosted Engine upgrade failed"

Expected results:
setup will warn admin that there is not enough disk space on the hosted engine storage domain

Additional info:

2017-08-14 16:18:07 INFO otopi.plugins.gr_he_upgradeappliance.engine.misc misc._check_sd_and_disk_space:236 The hosted-engine storage domain has enough free space to contain a new backup disk.
2017-08-14 16:18:07 WARNING otopi.plugins.gr_he_upgradeappliance.engine.misc misc._check_sd_and_disk_space:252 On the hosted-engine disk there is not enough available space to fit the new appliance disk: required 50GiB - available 40GiB. 
.
.
.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-upgradeappliance/core/misc.py", line 303, in _validata_lm_volumes
    stopped = status['all_host_stats'][h]['stopped']
KeyError: 'stopped'

Comment 1 Simone Tiraboschi 2017-08-28 08:29:47 UTC
Let's try to recap:
 2017-08-14 16:17:45 DEBUG otopi.ovirt_hosted_engine_setup.domains domains.check_available_space:116 Available space on /var/tmp is 42784Mb
 2017-08-14 16:17:45 DEBUG otopi.context context.dumpEnvironment:760 ENVIRONMENT DUMP - BEGIN
 2017-08-14 16:17:45 DEBUG otopi.context context.dumpEnvironment:770 ENV OVEHOSTED_STORAGE/ovfSizeGB=int:'50'
 2017-08-14 16:17:45 DEBUG otopi.context context.dumpEnvironment:770 ENV OVEHOSTED_STORAGE/qcowSizeGB=int:'4'

The OVF image is 4 GB and in /var/tmp we have 42784Mb free so there is enough space to extract there the image from the OVA archive.

On the hosted-engine SD instead you have 
 2017-08-14 16:18:07 DEBUG otopi.plugins.gr_he_upgradeappliance.engine.misc misc._check_sd_and_disk_space:202 Successfully connected to the engine
 2017-08-14 16:18:07 DEBUG otopi.plugins.gr_he_upgradeappliance.engine.misc misc._check_sd_and_disk_space:211 availalbe: 141733920768

Which are 132 GB so no issue there and indeed:
 2017-08-14 16:18:07 INFO otopi.plugins.gr_he_upgradeappliance.engine.misc misc._check_sd_and_disk_space:236 The hosted-engine storage domain has enough free space to contain a new backup disk.

The appliance disk is now sized at 50 Gb, we can grow it on the fly but we cannot shrink.

The warning was about the size of the disk of the 3.6 engine VM that was at 40 GB while the new 4.0 appliance requires 50GB but the setup can grow it and so we have just a warning:
 2017-08-14 16:18:07 WARNING otopi.plugins.gr_he_upgradeappliance.engine.misc misc._check_sd_and_disk_space:252 On the hosted-engine disk there is not enough 
 available space to fit the new appliance disk: required 50GiB - available 40GiB.
 2017-08-14 16:18:07 DEBUG otopi.plugins.otopi.dialog.human human.queryString:145 query UPGRADE_DISK_RESIZE_PROCEED
 2017-08-14 16:18:07 DEBUG otopi.plugins.otopi.dialog.human dialog.__logString:204 DIALOG:SEND                 This upgrade tool can resize the hosted-engine VM disk; before resizing a backup will be created.
 2017-08-14 16:18:07 DEBUG otopi.plugins.otopi.dialog.human dialog.__logString:204 DIALOG:SEND                  Are you sure you want to continue? (Yes, No)[Yes]:

And the user accepted to have the setup growing the VM disk automatically.
No issue up to now.

The issue is instead here trying to validate the status of the hosted-engine hosts from the metadata area on the shared storage:
2017-08-14 16:19:15 DEBUG otopi.context context._executeMethod:142 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-upgradeappliance/core/misc.py", line 303, in _validata_lm_volumes
    stopped = status['all_host_stats'][h]['stopped']
KeyError: 'stopped'

And indeed from the logs we can see:

 2: {'engine-status': '{"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}',
  'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=2515837 (Fri Mar 10 13:46:56 2017)\nhost-id=2\nscore=2400\nmaintenance=False\nstate=EngineDown\n',
  'host-id': 2,
  'host-ts': 2515837,
  'hostname': '****02.******.**',
  'live-data': False,
  'maintenance': False,
  'score': 2400},


while we have instead something like this for 3.6 hosts:

 9: {'conf_on_shared_storage': False,
  'crc32': 'd96e718b',
  'engine-status': '{"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}',
  'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=10106154 (Mon Aug 14 16:19:08 2017)\nhost-id=9\nscore=3400\nvm_conf_refresh_time=10106160 (Mon Aug 14 16:19:15 2017)\nconf_on_shared_storage=False\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n',
  'host-id': 9,
  'host-ts': 10106154,
  'hostname': '***09.******.**',
  'live-data': True,
  'local_conf_timestamp': 10106160,
  'maintenance': False,
  'score': 3400,
  'stopped': False},

So there was still a reference to host 02 in metadata area on the shared storage but its structure was still in 3.5 shape missing 'stopped' attribute and so this issue.

We check the datacenter and cluster level from the engine but here everything was fine there:
 2017-08-14 16:18:11 DEBUG otopi.plugins.gr_he_upgradeappliance.engine.misc misc._check_upgrade_requirements:315 Successfully connected to the engine
 2017-08-14 16:18:11 INFO otopi.plugins.gr_he_upgradeappliance.engine.misc misc._check_upgrade_requirements:344 All the datacenters and clusters are at a compatible level
 
The 3.5 hosts have probably been removed from the engine but they are still present in the metadata area on the shared storage and so this issue.

Workaround:
run
 hosted-engine --vm-status
and, one by one, remove all the decommissioned hosts with:
 hosted-engine --clean-metadata --host-id=<id>

Comment 3 Artyom 2017-12-14 10:52:57 UTC
Verified on ovirt-hosted-engine-setup-2.2.1-1.el7ev.noarch

[ INFO  ] The hosted-engine storage domain has enough free space to contain a new backup disk.                                                            
[ INFO  ] Checking version requirements
[ INFO  ] Checking metadata area
[ ERROR ] Metadata for host alma05.qa.lab.tlv.redhat.com is incompatible with this tool.
         Before proceeding with this upgrade, please correctly upgrade it to 3.6 or clean its metadata area with
          'hosted-engine --clean-metadata --host-id=2'
         if decommissioned or not anymore involved in HE.
[ ERROR ] Failed to execute stage 'Environment customization': Host with unsupported metadata area
[ INFO  ] Stage: Clean up
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
[ ERROR ] Hosted Engine upgrade failed
          Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20171214101416-syfq5n.log

Comment 6 errata-xmlrpc 2018-05-15 17:32:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1471

Comment 7 Franta Kust 2019-05-16 13:06:43 UTC
BZ<2>Jira Resync