Description of problem: hosted-engine --upgrade-appliance fails with strange error, see below. it is long running 3.6 prod env of brq rhev qe team (aka brq-setup). ~~~ ... [ INFO ] The engine VM is running on this host [ INFO ] Stage: Environment customization [ INFO ] Answer file successfully loaded [ ERROR ] Failed to execute stage 'Environment customization': File contains no section headers. file: <? ??>, line: 1 u'None' [ INFO ] Stage: Clean up [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination [ ERROR ] Hosted Engine upgrade failed Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20160814035 947-y3efzj.log You have new mail in /var/spool/mail/root ~~~ Version-Release number of selected component (if applicable): ovirt-hosted-engine-setup-2.0.1.4-1.el7ev.noarch How reproducible: hit our she setup Steps to Reproduce: 1. she 3.6 env - long time existing she env 2. stop vms, put all non HE hosts into maintenance, set global maintenance 3. on the HE host where HE VM is running with 4.0 rpms do - hosted-engine --upgrade-appliance Actual results: migration fails, hosted-engine can't read answer file Expected results: should work, clean 3.6 -> 4.0 has worked fine Additional info:
(In reply to Jiri Belka from comment #0) > Description of problem: > > hosted-engine --upgrade-appliance fails with strange error, see below. it is > long running 3.6 prod env of brq rhev qe team (aka brq-setup). > setup log has: 2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile remote_answerfile._fetch_answer_file:69 fetching from: /rhev/data-center/mnt/10.34.63.202:_mnt_export_nfs_lv2___brq-setup/23c03bb6-9889-4cbf-b7ad-55b9a2c70653/images/0e961108-e830-440e-a613-7739f5852a0a/c1228161-4822-47f9-945e-3273d9fd445a 2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile heconflib._dd_pipe_tar:69 executing: 'sudo -u vdsm dd if=/rhev/data-center/mnt/10.34.63.202:_mnt_export_nfs_lv2___brq-setup/23c03bb6-9889-4cbf-b7ad-55b9a2c70653/images/0e961108-e830-440e-a613-7739f5852a0a/c1228161-4822-47f9-945e-3273d9fd445a bs=4k' 2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile heconflib._dd_pipe_tar:70 executing: 'tar -tvf -' 2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile heconflib._dd_pipe_tar:88 stdout: -rw-r--r-- 0/0 7 1970-01-01 01:00 version -rw-r--r-- 0/0 4 1970-01-01 01:00 fhanswers.conf -rw-r--r-- 0/0 1046 1970-01-01 01:00 hosted-engine.conf -rw-r--r-- 0/0 182 1970-01-01 01:00 broker.conf -rw-r--r-- 0/0 1317 1970-01-01 01:00 vm.conf 2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile heconflib._dd_pipe_tar:89 stderr: 2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile heconflib.extractConfFile:138 extracting 'fhanswers.conf' from '/rhev/data-center/mnt/10.34.63.202:_mnt_export_nfs_lv2___brq-setup/23c03bb6-9889-4cbf-b7ad-55b9a2c70653/images/0e961108-e830-440e-a613-7739f5852a0a/c1228161-4822-47f9-945e-3273d9fd445a' 2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile heconflib._dd_pipe_tar:69 executing: 'sudo -u vdsm dd if=/rhev/data-center/mnt/10.34.63.202:_mnt_export_nfs_lv2___brq-setup/23c03bb6-9889-4cbf-b7ad-55b9a2c70653/images/0e961108-e830-440e-a613-7739f5852a0a/c1228161-4822-47f9-945e-3273d9fd445a bs=4k' 2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile heconflib._dd_pipe_tar:70 executing: 'tar -xOf - fhanswers.conf' 2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile heconflib._dd_pipe_tar:88 stdout: None 2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile heconflib._dd_pipe_tar:89 stderr: 2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile remote_answerfile._fetch_answer_file:82 Answer file form the shared storage: None 2016-08-14 03:59:52 INFO otopi.plugins.gr_he_common.core.remote_answerfile remote_answerfile._fetch_answer_file:85 Answer file successfully loaded 2016-08-14 03:59:52 DEBUG otopi.context context._executeMethod:142 method exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod method['method']() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-common/core/remote_answerfile.py", line 177, in _customization self._parse_answer_file() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-common/core/remote_answerfile.py", line 89, in _parse_answer_file self._config.readfp(buf) File "/usr/lib64/python2.7/ConfigParser.py", line 324, in readfp self._read(fp, filename) File "/usr/lib64/python2.7/ConfigParser.py", line 512, in _read raise MissingSectionHeaderError(fpname, lineno, line) MissingSectionHeaderError: File contains no section headers. file: <???>, line: 1 u'None' 2016-08-14 03:59:52 ERROR otopi.context context._executeMethod:151 Failed to execute stage 'Environment customization': File contains no section headers. fhanswers.conf is 4 bytes long, most likely corrupt/empty/etc. Please attach setup/HA/vdsm logs and '/etc/ovirt-hosted-engine/answers.conf' from all HE hosts from around the time it was upgraded from 3.5 to 3.6. Thanks.
(These 4 bytes are 'None')
I strongly suspect that also adding additional hosts to that setup will be problematic. The question is how in the past we lost the answer file creating the configuration volume.
(In reply to Yedidyah Bar David from comment #2) > [...] > > Please attach setup/HA/vdsm logs and '/etc/ovirt-hosted-engine/answers.conf' > from all HE hosts from around the time it was upgraded from 3.5 to 3.6. We do not have such backup, but we do have only backup done via engine-backup from that time which although does not contain logs and files from hosts as it is engine oriented. I also do not see any backup files in /etc/ovirt-hosted-engine on hosts. You can ping me on irc if any other info is needed.
To clarify for ordinary people :) [root@slot-2 ~]# file=$(awk -F= '/^conf_volume_UUID/ { print $2 }' /etc/ovirt-hosted-engine/hosted-engine.conf) [root@slot-2 ~]# find /rhev/data-center/mnt/10.34.63.202:_mnt_export_nfs_lv2___brq-setup -type f -name "$file" | xargs tar tvf -rw-r--r-- 0/0 7 1970-01-01 01:00 version -rw-r--r-- 0/0 4 1970-01-01 01:00 fhanswers.conf -rw-r--r-- 0/0 1046 1970-01-01 01:00 hosted-engine.conf -rw-r--r-- 0/0 182 1970-01-01 01:00 broker.conf -rw-r--r-- 0/0 1317 1970-01-01 01:00 vm.conf vs a HE host where the answer file exists: [root@10-34-60-215 ~]# file=$(awk -F= '/^conf_volume_UUID/ { print $2 }' /etc/ovirt-hosted-engine/hosted-engine.conf) [root@10-34-60-215 ~]# find /rhev/data-center/mnt/10.34.63.199:_jbelka_test -type f -name "$file" | xargs tar -tvf -rw-r--r-- 0/0 7 1970-01-01 01:00 version -rw-r--r-- 0/0 2454 1970-01-01 01:00 fhanswers.conf -rw-r--r-- 0/0 739 1970-01-01 01:00 hosted-engine.conf -rw-r--r-- 0/0 182 1970-01-01 01:00 broker.conf -rw-r--r-- 0/0 1316 1970-01-01 01:00 vm.conf
I tried to reproduce 3.5 -> 3.6 SHE migration and fhanswers.conf on storage domain is equal to /etc/ovirt-hosted-engine/answers.conf, thus it has data.
(In reply to Jiri Belka from comment #8) > I tried to reproduce 3.5 -> 3.6 SHE migration and fhanswers.conf on storage > domain is equal to /etc/ovirt-hosted-engine/answers.conf, thus it has data. Please try this to reproduce: 1. deploy 3.5 2. rm /etc/ovirt-hosted-engine/answers.conf 3. yum update to 3.6 Then check fhanswers.conf on the shared storage after upgrade finishes. You should see this in agent.log: Upgrading to current version Saving hosted-engine configuration on the shared storage domain Configuration file '{path}' not available: {ex} Successfully moved the configuration to the shared storage You are then welcome to open a bug on -ha. Not sure exactly what you should write in 'Expected Results', probably that it should fail instead of writing None. You can try enforcing another upgrade by removing the shared conf volume. This will revert any changes done to engine vm conf, to what's saved in /etc.
(In reply to Yedidyah Bar David from comment #9) > (In reply to Jiri Belka from comment #8) > > I tried to reproduce 3.5 -> 3.6 SHE migration and fhanswers.conf on storage > > domain is equal to /etc/ovirt-hosted-engine/answers.conf, thus it has data. > > Please try this to reproduce: > 1. deploy 3.5 > 2. rm /etc/ovirt-hosted-engine/answers.conf > 3. yum update to 3.6 > > Then check fhanswers.conf on the shared storage after upgrade finishes. You > should see this in agent.log: > > Upgrading to current version > Saving hosted-engine configuration on the shared storage domain > Configuration file '{path}' not available: {ex} > Successfully moved the configuration to the shared storage It worked fine during my 3.4 -> 3.5 -> 3.6 migration. I see above lines except "Configuration file '{path}' not available: {ex}" in agent.log. I see that fhanswers.conf is equal to /etc/ovirt-hosted-engine/answers.conf on the host where these lines on agent.log appear. On other host the file differs. > You are then welcome to open a bug on -ha. Not sure exactly what you should > write in 'Expected Results', probably that it should fail instead of writing > None. > > You can try enforcing another upgrade by removing the shared conf volume. > This will revert any changes done to engine vm conf, to what's saved in /etc. Please be more specific how we can "repair" our setup, we need an advice to workaround this current issue.
(In reply to Jiri Belka from comment #10) > (In reply to Yedidyah Bar David from comment #9) > > (In reply to Jiri Belka from comment #8) > > > I tried to reproduce 3.5 -> 3.6 SHE migration and fhanswers.conf on storage > > > domain is equal to /etc/ovirt-hosted-engine/answers.conf, thus it has data. > > > > Please try this to reproduce: > > 1. deploy 3.5 > > 2. rm /etc/ovirt-hosted-engine/answers.conf > > 3. yum update to 3.6 > > > > Then check fhanswers.conf on the shared storage after upgrade finishes. You > > should see this in agent.log: > > > > Upgrading to current version > > Saving hosted-engine configuration on the shared storage domain > > Configuration file '{path}' not available: {ex} > > Successfully moved the configuration to the shared storage > > It worked fine during my 3.4 -> 3.5 -> 3.6 migration. I see above lines > except "Configuration file '{path}' not available: {ex}" in agent.log. Because you didn't try (2.) above. > > I see that fhanswers.conf is equal to /etc/ovirt-hosted-engine/answers.conf > on the host where these lines on agent.log appear. > > On other host the file differs. > > > You are then welcome to open a bug on -ha. Not sure exactly what you should > > write in 'Expected Results', probably that it should fail instead of writing > > None. > > > > You can try enforcing another upgrade by removing the shared conf volume. > > This will revert any changes done to engine vm conf, to what's saved in /etc. > > Please be more specific how we can "repair" our setup, we need an advice to > workaround this current issue. I'd not start trying to repair your setup before we reproduce on a test system to test the workaround.
(In reply to Yedidyah Bar David from comment #11) > (In reply to Jiri Belka from comment #10) > > (In reply to Yedidyah Bar David from comment #9) > > > (In reply to Jiri Belka from comment #8) > > > > I tried to reproduce 3.5 -> 3.6 SHE migration and fhanswers.conf on storage > > > > domain is equal to /etc/ovirt-hosted-engine/answers.conf, thus it has data. > > > > > > Please try this to reproduce: > > > 1. deploy 3.5 > > > 2. rm /etc/ovirt-hosted-engine/answers.conf > > > 3. yum update to 3.6 > > > > > > Then check fhanswers.conf on the shared storage after upgrade finishes. You > > > should see this in agent.log: > > > > > > Upgrading to current version > > > Saving hosted-engine configuration on the shared storage domain > > > Configuration file '{path}' not available: {ex} > > > Successfully moved the configuration to the shared storage > > > > It worked fine during my 3.4 -> 3.5 -> 3.6 migration. I see above lines > > except "Configuration file '{path}' not available: {ex}" in agent.log. > > Because you didn't try (2.) above. > > > > > I see that fhanswers.conf is equal to /etc/ovirt-hosted-engine/answers.conf > > on the host where these lines on agent.log appear. > > > > On other host the file differs. > > > > > You are then welcome to open a bug on -ha. Not sure exactly what you should > > > write in 'Expected Results', probably that it should fail instead of writing > > > None. > > > > > > You can try enforcing another upgrade by removing the shared conf volume. > > > This will revert any changes done to engine vm conf, to what's saved in /etc. > > > > Please be more specific how we can "repair" our setup, we need an advice to > > workaround this current issue. > > I'd not start trying to repair your setup before we reproduce on a test > system to test the workaround. # file=$( awk -F= '/^conf_volume/ { print $2 }' /etc/ovirt-hosted-engine/hosted-engine.conf ) # domain=$( awk -F= '/^sdUUID/ { print $2 }' /etc/ovirt-hosted-engine/hosted-engine.conf ) # find /rhev/data-center/ -path "*/$domain/*" -type f -name "$file" | xargs -I {} tar Oxf {} version 1.3.5.7[root@dell-r210ii-03 ~]# # find /rhev/data-center/ -path "*/$domain/*" -type f -name "$file" | xargs -I {} tar Oxf {} fhanswers.conf None# # egrep "(Upgrading to current|Saving hosted-engine|Configuration file|Successfully moved)" /var/log/ovirt-hosted-engine-ha/agent.log MainThread::INFO::2016-08-16 15:15:43,539::upgrade::997::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version MainThread::INFO::2016-08-16 15:16:28,889::upgrade::997::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version MainThread::INFO::2016-08-16 15:16:45,458::upgrade::408::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_create_conf_tar) Saving hosted-engine configuration on the shared storage domain MainThread::ERROR::2016-08-16 15:16:45,459::upgrade::396::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Configuration file '/etc/ovirt-hosted-engine/answers.conf' not available: [Errno 2] No such file or directory: '/etc/ovirt-hosted-engine/answers.conf' MainThread::INFO::2016-08-16 15:16:45,526::upgrade::975::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_move_to_shared_conf) Successfully moved the configuration to the shared storage # grep ovirt-hosted /var/log/yum.log Aug 16 11:46:57 Installed: ovirt-hosted-engine-ha-1.2.10-1.el7ev.noarch Aug 16 11:46:58 Installed: ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch Aug 16 15:14:39 Updated: ovirt-hosted-engine-ha-1.3.5.7-1.el7ev.noarch Aug 16 15:14:40 Updated: ovirt-hosted-engine-setup-1.3.7.2-1.el7ev.noarch
Great. Please attach from host: /etc/ovirt-hosted* /var/log/ovirt-hosted* . Thanks.
As a consequence of above proposed steps, this is what happened on a non-SPM host with updated rpms to 4.0: # hosted-engine --upgrade-appliance /usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py:15: DeprecationWarning: vdscli uses xmlrpc. since ovirt 3.6 xmlrpc is deprecated, please use vdsm.jsonrpcvdscli import vdsm.vdscli [ INFO ] Stage: Initializing [ ERROR ] Failed to execute stage 'Initializing': 'Configuration value not found: file=/etc/ovirt-hosted-engine/hosted-engine.conf, key=conf_image_UUID' [ INFO ] Stage: Clean up [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination [ ERROR ] Hosted Engine deployment failed Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20160817105124-um2nww.log conf_image_UUID really does not exists in /etc/ovirt-hosted-engine/hosted-engine.conf on this host but it does exists in the file on other host which is still 3.6 and which used to be one used for migration from 3.5 -> 3.6.
So I updated rpms on the other host which used to be one used for migration from 3.5 -> 3.6 and where I deleted /etc/ovirt-hosted-engine/answers.conf as requested and it finished in same issue: ... [ INFO ] Answer file successfully loaded [ ERROR ] Failed to execute stage 'Environment customization': File contains no section headers. file: <???>, line: 1 u'None' [ INFO ] Stage: Clean up [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination [ ERROR ] Hosted Engine upgrade failed ...
So we have a reproduction. I think the following might be a workaround. This will remove the configuration volume from the shared storage and force HA to re-create it. 1. Choose one host to do this on. Make sure its local configuration (in /etc/ovirt-hosted-engine) is up-to-date. You can compare with what's on the shared storage and other hosts. 2. Move all hosts to local maintenance and stop HA services on all of them. 3. Delete the configuration volume. It should be something like: /rhev/data-center/mnt/*/$sdUUID/images/$conf_image_UUID/$conf_volume_UUID Where sdUUID conf_image_UUID conf_volume_UUID are taken from /etc/ovirt-hosted-engine/hosted-engine.conf . 4. Edit /etc/ovirt-hosted-engine/hosted-engine.conf as follows: Remove lines starting with: conf_volume_UUID= conf_image_UUID= vm_disk_vol_id= Edit the line: spUUID=00000000-0000-0000-0000-000000000000 to be: spUUID=$POOL_UUID where POOL_UUID is from /rhev/data-center/mnt/didi-lap:_he1/$sdUUID/dom_md/metadata 5. Start HA services on this host. 6. Monitor agent.log to see the upgrade flow as in comment 12, hopefully this time without an error. 7. Compare /etc/ovirt-hosted-engine/hosted-engine.conf with what you had before. The new file should have new IDs for conf_*_UUID and spUUID=00000000-0000-0000-0000-000000000000. 8. Start HA on all other hosts and move all hosts out of local maintenance. When finished, please attach /etc/ovirt-hosted* /var/log/ovirt-hosted* from all hosts. Setting needinfo also on Simone to review. Thanks.
I created a BZ about checking if content of conf_volume tarball is valid https://bugzilla.redhat.com/show_bug.cgi?id=1367732
The proposed workaround seams correct and complete but we have to properly test it. As far as I understood, this issue could only happen if the initial answer file from setup time has been removed from the host before 3.5 -> 3.6 upgrade. In that case the upgrade procedure should fail reporting the issue while it's probably just writing an empty/dumb answer file to the shared storage which lead to future issues as this bug.
(In reply to Yedidyah Bar David from comment #18) > 4. Edit /etc/ovirt-hosted-engine/hosted-engine.conf as follows: > Remove lines starting with: > conf_volume_UUID= > conf_image_UUID= > vm_disk_vol_id= > > Edit the line: > spUUID=00000000-0000-0000-0000-000000000000 > to be: > spUUID=$POOL_UUID > where POOL_UUID is from > /rhev/data-center/mnt/didi-lap:_he1/$sdUUID/dom_md/metadata I do not see POOL_UUID value in the path which contains $sdUUID. Any idea? # grep -E '^(sdUUID|conf_)' /etc/ovirt-hosted-engine/hosted-engine.conf.orig sdUUID=990f8f44-a511-4a46-9f8c-468ca9eda05d conf_volume_UUID=bd8c5cb7-4898-4108-a3f1-773c7c9f4cf5 conf_image_UUID=de2f46cf-5143-4db8-9122-078d2e4cac0d # find /rhev/data-center/ -path "*/$domain/*" -type f -name 'metadata' | xargs cat CLASS=Data DESCRIPTION=hosted_storage IOOPTIMEOUTSEC=10 LEASERETRIES=3 LEASETIMESEC=60 LOCKPOLICY=ON LOCKRENEWALINTERVALSEC=5 MASTER_VERSION=1 POOL_DESCRIPTION=hosted_datacenter POOL_DOMAINS=990f8f44-a511-4a46-9f8c-468ca9eda05d:Active POOL_SPM_ID=-1 POOL_SPM_LVER=-1 POOL_UUID= REMOTE_PATH=10.34.63.199:/jbelka/jb-she_test ROLE=Regular SDUUID=990f8f44-a511-4a46-9f8c-468ca9eda05d TYPE=NFS VERSION=3 _SHA_CKSUM=55080f2b6960048c2d653b846ffad999f4feb825 # find /rhev/data-center/ -type f -name 'metadata' | xargs grep '^POOL_UUID' /rhev/data-center/mnt/10.34.63.199:_jbelka_jb-she__test/990f8f44-a511-4a46-9f8c-468ca9eda05d/dom_md/metadata:POOL_UUID= /rhev/data-center/mnt/_var_lib_ovirt-hosted-engine-ha_tmpRqSNVH/8a1780bf-7ae0-46a6-918e-c4f06b44b2b0/dom_md/metadata:POOL_UUID=4ed71ad1-4ad6-4278-be92-a26e67f98f22 /rhev/data-center/mnt/10.34.63.199:_jbelka_jb-she__test-data/d12f4e3a-4637-422c-95f1-5816bdf01f22/dom_md/metadata:POOL_UUID=00000002-0002-0002-0002-0000000001dd
One thing I know for sure: Do not leave spUUID=00000000-0000-0000-0000-000000000000 because that's the flag for the upgrade process to know that it was done. Other that that, not sure - perhaps just remove the line, or pub some random uuid. Keeping needinfo, perhaps Simone knows better. I think it will simply fail without doing harm if you do something "wrong", so you can simply try various things.
The upgrade procedure is also trying to detach the hosted-engine storage domain from its bootstrap storage pool to let the engine import it so the fixing procedure reported on comment 18 will fail since our storage domain was already detached. Reattaching the hosted-engine storage domain to its bootstrap storage pool is far to complex and risky since the engine already imported it. An easiest procedure is to simply re-write the existing configuration volume including the correct answer file. This script will do the job: #!/bin/sh dir=`mktemp -d` && cd $dir sdUUID=$(awk -F= '/^sdUUID/ { print $2 }' /etc/ovirt-hosted-engine/hosted-engine.conf) conf_volume_UUID=$(awk -F= '/^conf_volume_UUID/ { print $2 }' /etc/ovirt-hosted-engine/hosted-engine.conf) conf_image_UUID=$(awk -F= '/^conf_image_UUID/ { print $2 }' /etc/ovirt-hosted-engine/hosted-engine.conf) systemctl stop ovirt-ha-broker # on all hosts! find /rhev/data-center/ -path "*/${sdUUID}/images/${conf_image_UUID}/${conf_volume_UUID}" -type f -exec sh -c 'sudo -u vdsm dd if=$1 2>/dev/null | tar -xvf - 2>/dev/null' {} {} \; cat /etc/ovirt-hosted-engine/answers.conf > fhanswers.conf # the source file should be sane find /rhev/data-center/ -path "*/${sdUUID}/images/${conf_image_UUID}/${conf_volume_UUID}" -type f -exec sh -c 'tar -cf- * | sudo -u vdsm dd of=$1 2>/dev/null' {} {} \; systemctl restart ovirt-ha-agent # on all hosts!
(In reply to Simone Tiraboschi from comment #23) > The upgrade procedure is also trying to detach the hosted-engine storage > domain from its bootstrap storage pool to let the engine import it so the > fixing procedure reported on comment 18 will fail since our storage domain > was already detached. > Reattaching the hosted-engine storage domain to its bootstrap storage pool > is far to complex and risky since the engine already imported it. > > An easiest procedure is to simply re-write the existing configuration volume > including the correct answer file. > > This script will do the job: > #!/bin/sh > dir=`mktemp -d` && cd $dir > sdUUID=$(awk -F= '/^sdUUID/ { print $2 }' > /etc/ovirt-hosted-engine/hosted-engine.conf) > conf_volume_UUID=$(awk -F= '/^conf_volume_UUID/ { print $2 }' > /etc/ovirt-hosted-engine/hosted-engine.conf) > conf_image_UUID=$(awk -F= '/^conf_image_UUID/ { print $2 }' > /etc/ovirt-hosted-engine/hosted-engine.conf) > systemctl stop ovirt-ha-broker # on all hosts! > find /rhev/data-center/ -path > "*/${sdUUID}/images/${conf_image_UUID}/${conf_volume_UUID}" -type f -exec sh > -c 'sudo -u vdsm dd if=$1 2>/dev/null | tar -xvf - 2>/dev/null' {} {} \; > cat /etc/ovirt-hosted-engine/answers.conf > fhanswers.conf # the source > file should be sane > find /rhev/data-center/ -path > "*/${sdUUID}/images/${conf_image_UUID}/${conf_volume_UUID}" -type f -exec sh > -c 'tar -cf- * | sudo -u vdsm dd of=$1 2>/dev/null' {} {} \; > systemctl restart ovirt-ha-agent # on all hosts! This workaround made me to be able to migration to 4.0 successfully.
Why did you clone to upstream? What is the diff between he bugs?
1367732 is on ovirt-hosted-engine-ha: refusing to upgrade (3.5->3.6) if the answerfile is missing on the host instead of writing a dumb answerfile on the shared storage 1366879 is on ovirt-hosted-engine-setup: providing a clear error message if the answerfile on the shared storage is not consumable
(In reply to Simone Tiraboschi from comment #26) > 1367732 is on ovirt-hosted-engine-ha: refusing to upgrade (3.5->3.6) if the > answerfile is missing on the host instead of writing a dumb answerfile on > the shared storage > > 1366879 is on ovirt-hosted-engine-setup: providing a clear error message if > the answerfile on the shared storage is not consumable Ok, please make sure this goes in to 3.6.9 as well.
(In reply to Yaniv Dary from comment #27) > Ok, please make sure this goes in to 3.6.9 as well. Cloned
This bug is a side effects of rhbz#1367732 These two bugs can only occurs if the systems has been initially deployed on 3.5 or before: at that point the answer file for hosted-engine setup was on /etc/ovirt-hosted-engine/answers.conf The root cause for this two bug, and the only way we found to reproduce it, is that the user has manually deleted the answer file on the host before upgrading to 3.6. The upgrade procedure from 3.5 -> 3.6 should copy it to the shared storage. Due to bug 1367732, the 3.5 -> 3.6 upgrade didn't stop if the answer file was missing and simply wrote an answer file with just 'None' inside the shared storage and this could lead to future issues. The patch for bug 1367732 will prevent this: the upgrade will not be performed till the user restores the answer file on his host. Future issues caused by the dumb answer file are tracked on this bug: the upgrade of the engine appliance to 4.0 (or also adding a new hosted-engine host from CLI) will fail if ovirt-hosted-engine-setup fails parsing the answer file on the shared storage. With the proposed patch hosted-engine-setup is simply emitting a clearer error message if the answer file on the shared storage is not valid. It's not going to auto-recovery the missing answer file since we don't have and we cannot guess all the required info. The most reasonable recovery action is to manually recover the lost answer file (from a backup or from another host) and copy it to the shared storage. Here we proposed a recovery script for that: https://bugzilla.redhat.com/show_bug.cgi?id=1366879#c23
(In reply to Simone Tiraboschi from comment #29) > The root cause for this two bug, and the only way we found to > reproduce it, is that the user has manually deleted the answer file on > the host before upgrading to 3.6. Another similar flow which we didn't see IIUC but seems possible is that the host that was used for upgrade didn't have the file originally. This can happen if the initial deploy failed at some later stage but before writing this file.
Works for me on these components on host: libvirt-client-1.2.17-13.el7_2.5.x86_64 ovirt-imageio-common-0.3.0-0.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch ovirt-hosted-engine-ha-2.0.3-1.el7ev.noarch qemu-kvm-rhev-2.3.0-31.el7_2.21.x86_64 sanlock-3.2.4-3.el7_2.x86_64 rhevm-appliance-20160731.0-1.el7ev.noarch ovirt-setup-lib-1.0.2-1.el7ev.noarch ovirt-hosted-engine-setup-2.0.1.5-1.el7ev.noarch mom-0.5.5-1.el7ev.noarch ovirt-host-deploy-1.5.1-1.el7ev.noarch vdsm-4.18.11-1.el7ev.x86_64 rhev-release-3.6.9-1-001.noarch ovirt-imageio-daemon-0.3.0-0.el7ev.noarch ovirt-vmconsole-host-1.0.4-1.el7ev.noarch rhev-release-4.0.3-1-001.noarch ovirt-engine-sdk-python-3.6.8.0-1.el7ev.noarch Linux version 3.10.0-327.36.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Wed Aug 17 03:02:37 EDT 2016 Linux 3.10.0-327.36.1.el7.x86_64 #1 SMP Wed Aug 17 03:02:37 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.2 (Maipo) Engine: ovirt-engine-dwh-setup-4.0.2-1.el7ev.noarch ovirt-image-uploader-4.0.0-1.el7ev.noarch ovirt-imageio-proxy-setup-0.3.0-0.el7ev.noarch ovirt-engine-webadmin-portal-4.0.3-0.1.el7ev.noarch ovirt-engine-restapi-4.0.3-0.1.el7ev.noarch ovirt-host-deploy-1.5.1-1.el7ev.noarch ovirt-engine-extension-aaa-jdbc-1.1.0-1.el7ev.noarch ovirt-engine-cli-3.6.8.1-1.el7ev.noarch ovirt-engine-websocket-proxy-4.0.3-0.1.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch ovirt-setup-lib-1.0.2-1.el7ev.noarch ovirt-engine-sdk-python-3.6.8.0-1.el7ev.noarch ovirt-log-collector-4.0.0-1.el7ev.noarch ovirt-imageio-proxy-0.3.0-0.el7ev.noarch ovirt-engine-tools-4.0.3-0.1.el7ev.noarch ovirt-engine-setup-base-4.0.3-0.1.el7ev.noarch ovirt-engine-setup-plugin-ovirt-engine-common-4.0.3-0.1.el7ev.noarch ovirt-engine-setup-plugin-ovirt-engine-4.0.3-0.1.el7ev.noarch python-ovirt-engine-sdk4-4.0.0-0.5.a5.el7ev.x86_64 ovirt-iso-uploader-4.0.0-1.el7ev.noarch ovirt-imageio-common-0.3.0-0.el7ev.noarch ovirt-engine-dashboard-1.0.3-1.el7ev.x86_64 ovirt-engine-userportal-4.0.3-0.1.el7ev.noarch ovirt-engine-4.0.3-0.1.el7ev.noarch ovirt-host-deploy-java-1.5.1-1.el7ev.noarch ovirt-engine-lib-4.0.3-0.1.el7ev.noarch ovirt-engine-setup-plugin-websocket-proxy-4.0.3-0.1.el7ev.noarch ovirt-engine-setup-4.0.3-0.1.el7ev.noarch ovirt-engine-vmconsole-proxy-helper-4.0.3-0.1.el7ev.noarch ovirt-engine-tools-backup-4.0.3-0.1.el7ev.noarch ovirt-vmconsole-proxy-1.0.4-1.el7ev.noarch ovirt-engine-dbscripts-4.0.3-0.1.el7ev.noarch ovirt-engine-dwh-4.0.2-1.el7ev.noarch ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.0.3-0.1.el7ev.noarch ovirt-engine-extensions-api-impl-4.0.3-0.1.el7ev.noarch ovirt-engine-backend-4.0.3-0.1.el7ev.noarch rhevm-spice-client-x86-msi-4.0-3.el7ev.noarch rhevm-doc-4.0.0-3.el7ev.noarch rhevm-spice-client-x64-msi-4.0-3.el7ev.noarch rhev-guest-tools-iso-4.0-5.el7ev.noarch rhevm-4.0.3-0.1.el7ev.noarch rhevm-branding-rhev-4.0.0-5.el7ev.noarch rhevm-guest-agent-common-1.0.12-3.el7ev.noarch rhevm-dependencies-4.0.0-1.el7ev.noarch rhevm-setup-plugins-4.0.0.2-1.el7ev.noarch rhev-release-4.0.3-1-001.noarch Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016 Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.2 (Maipo) During hosted-engine --upgrade-appliance I've used the rhevm-appliance-20160731.0-1.el7ev.noarch, then updated engine's repos and installed the latest 4.0.3 bits.
AFAIU, current bug is about bad answer file in shared storage, and verification should have started from a 3.6 system in that state, and result should have been a nicer error message. No?
Created attachment 1195906 [details] Picture of extending the hosted storage via WEBUI
Please add these missing steps to documentation: 1)Customer might have to extend the hosted-storage as appears within the attachment, prior to upgrade. 2)Add that host's components should be 4.0.3 and not 3.6.9, otherwise "--upgrade-appliance" functionality won't be available on host.
(In reply to Yedidyah Bar David from comment #37) > AFAIU, current bug is about bad answer file in shared storage, and > verification should have started from a 3.6 system in that state, and result > should have been a nicer error message. No? The environment really started from 3.6.9, then it was upgraded to 4.0.3.
(In reply to Nikolai Sednev from comment #39) > Please add these missing steps to documentation: > 1)Customer might have to extend the hosted-storage as appears within the > attachment, prior to upgrade. > 2)Add that host's components should be 4.0.3 and not 3.6.9, otherwise > "--upgrade-appliance" functionality won't be available on host. Hi Nikolai, Thanks for letting me know about the documentation requirements. To clarify those two points: 1) Does this mean that a customer may need to add additional space to their shared self-hosted engine storage? Do we have any idea how much extra space is required? It would be clearer to give a minimum storage value, so a customer could check whether more space is required before they begin the upgrade. 2) Does this mean that 'rhel-7-server-rhv-4.0-rpms' and 'rhel-7-server-rhv-4-mgmt-agent-rpms' must be enabled on the host, and the 'ovirt-hosted-engine-setup' package updated before the user can run 'hosted-engine --upgrade-appliance'? I assume this would apply only to updating RHEL hosts, because new RHVH hosts would have the required package versions already. Our current documentation is available here: https://access.redhat.com/documentation/en/red-hat-virtualization/4.0/single/self-hosted-engine-guide#Upgrading_the_Self-Hosted_Engine It may be easier to point to certain places in the current documentation where you think a change is required.
(In reply to Lucy Bopf from comment #41) > (In reply to Nikolai Sednev from comment #39) > > Please add these missing steps to documentation: > > 1)Customer might have to extend the hosted-storage as appears within the > > attachment, prior to upgrade. > > 2)Add that host's components should be 4.0.3 and not 3.6.9, otherwise > > "--upgrade-appliance" functionality won't be available on host. > > Hi Nikolai, > > Thanks for letting me know about the documentation requirements. To clarify > those two points: > > 1) Does this mean that a customer may need to add additional space to their > shared self-hosted engine storage? Do we have any idea how much extra space > is required? It would be clearer to give a minimum storage value, so a > customer could check whether more space is required before they begin the > upgrade. > > 2) Does this mean that 'rhel-7-server-rhv-4.0-rpms' and > 'rhel-7-server-rhv-4-mgmt-agent-rpms' must be enabled on the host, and the > 'ovirt-hosted-engine-setup' package updated before the user can run > 'hosted-engine --upgrade-appliance'? I assume this would apply only to > updating RHEL hosts, because new RHVH hosts would have the required package > versions already. > > Our current documentation is available here: > https://access.redhat.com/documentation/en/red-hat-virtualization/4.0/single/ > self-hosted-engine-guide#Upgrading_the_Self-Hosted_Engine > > It may be easier to point to certain places in the current documentation > where you think a change is required. 1-Yes, in case of lack of storage on hosted storage iSCSI LUN, user may expand the storage using their storage appliances, like I did, using XIO storage appliance, then I've expanded hosted storage within the engine via WEBUI, as appears in attachment. In my case I had 75G of initial storage, and after I had 3.6.9 running, I had only 20G left, while minimum of additionally 50G was required, as I saw from my failed upgrade shell screen. I've expanded the LUN to have 150G just in case, then as appears in attachment, I had additional +75G of storage. I've clicked on that "button" and then on "Ok", then re-ran the upgrade using "hosted-engine --upgrade-appliance" command. 2-As you probably may see from https://bugzilla.redhat.com/show_bug.cgi?id=1366879#c35, I did not seen the "--upgrade-appliance" option at all, while I was running on 3.6.9's components on my el7.2 host. Only after I've made changes to my host's repos to match with 4.0.3 and then updated them, I could get the option as required. In my case host have to be upgraded to 4.0.3 first, then "hosted-engine --upgrade-appliance" can be initiated and of course, you will need the appliance installed prior to running "hosted-engine --upgrade-appliance" on your host.
(In reply to Lucy Bopf from comment #41) > Thanks for letting me know about the documentation requirements. To clarify > those two points: 1. It requires enough free space to contain a copy of the existing engine VM disk. Normally it's not an issue on NFS/GLuster but it could require to manually expand the LUN used for the hosted-engine storage domain as Nikolai described. The backup disk will not be automatically deleted at the end of the upgrade; it's up to the user to destroy it when he is sure that everything is OK. 2. Yes, it's required to upgrade the host to 4.0 to gain the new feature.
Sandro/Simone, in which d/s release was this fixed in 4.0? Can you please link errata with the correct version?
2.0.1.5 as for bug https://bugzilla.redhat.com/show_bug.cgi?id=1369712 ( in the "Blocks" section ) with the errata linked there ( https://rhn.redhat.com/errata/RHBA-2016-1801.html )