Bug 1366879 - --upgrade-appliance - Failed to execute stage 'Environment customization': File contains no section headers. file: <???>, line: 1 u'None'
Summary: --upgrade-appliance - Failed to execute stage 'Environment customization': Fi...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-setup
Classification: oVirt
Component: General
Version: 2.0.1.4
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ovirt-4.0.3
: 2.0.1.5
Assignee: Simone Tiraboschi
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On: 1367732
Blocks: RHEV_4_upgrade_tracker 1368399 1369712 1373052
TreeView+ depends on / blocked
 
Reported: 2016-08-14 02:14 UTC by Jiri Belka
Modified: 2019-04-28 13:10 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Hosted-engine upgrade to 4.0 was failing with an unclear error if the answer file on the shared storage was not valid due to other issues. Now it will fail with a clear error and the user could recover.
Clone Of:
: 1368399 (view as bug list)
Environment:
Last Closed: 2016-08-31 09:34:53 UTC
oVirt Team: Integration
Embargoed:
rule-engine: ovirt-4.0.z+
rule-engine: blocker+
ylavi: planning_ack+
sbonazzo: devel_ack+
pstehlik: testing_ack+


Attachments (Terms of Use)
Picture of extending the hosted storage via WEBUI (152.70 KB, image/png)
2016-08-30 13:33 UTC, Nikolai Sednev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1368127 0 unspecified CLOSED [downstream clone - 3.6.9] If ovirt-ha-agent fails to read local answers.conf during upgrade, it writes None to shared f... 2021-02-22 00:41:40 UTC
oVirt gerrit 62538 0 master MERGED answerfile: emit a clear message on parsing errors 2020-06-26 12:39:21 UTC
oVirt gerrit 62702 0 ovirt-hosted-engine-setup-2.0 MERGED answerfile: emit a clear message on parsing errors 2020-06-26 12:39:21 UTC
oVirt gerrit 62703 0 ovirt-hosted-engine-setup-1.3 MERGED answerfile: emit a clear message on parsing errors 2020-06-26 12:39:21 UTC

Internal Links: 1368127

Description Jiri Belka 2016-08-14 02:14:32 UTC
Description of problem:

hosted-engine --upgrade-appliance fails with strange error, see below. it is long running 3.6 prod env of brq rhev qe team (aka brq-setup).

~~~
...
[ INFO  ] The engine VM is running on this host
[ INFO  ] Stage: Environment customization
[ INFO  ] Answer file successfully loaded
[ ERROR ] Failed to execute stage 'Environment customization': File contains no section headers. file: <?
??>, line: 1 u'None'
[ INFO  ] Stage: Clean up
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
[ ERROR ] Hosted Engine upgrade failed
          Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20160814035
947-y3efzj.log
You have new mail in /var/spool/mail/root
~~~

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup-2.0.1.4-1.el7ev.noarch

How reproducible:
hit our she setup

Steps to Reproduce:
1. she 3.6 env - long time existing she env
2. stop vms, put all non HE hosts into maintenance, set global maintenance
3. on the HE host where HE VM is running with 4.0 rpms do - hosted-engine --upgrade-appliance

Actual results:
migration fails, hosted-engine can't read answer file

Expected results:
should work, clean 3.6 -> 4.0 has worked fine

Additional info:

Comment 2 Yedidyah Bar David 2016-08-14 13:29:41 UTC
(In reply to Jiri Belka from comment #0)
> Description of problem:
> 
> hosted-engine --upgrade-appliance fails with strange error, see below. it is
> long running 3.6 prod env of brq rhev qe team (aka brq-setup).
> 

setup log has:

2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile remote_answerfile._fetch_answer_file:69 fetching from: /rhev/data-center/mnt/10.34.63.202:_mnt_export_nfs_lv2___brq-setup/23c03bb6-9889-4cbf-b7ad-55b9a2c70653/images/0e961108-e830-440e-a613-7739f5852a0a/c1228161-4822-47f9-945e-3273d9fd445a
2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile heconflib._dd_pipe_tar:69 executing: 'sudo -u vdsm dd if=/rhev/data-center/mnt/10.34.63.202:_mnt_export_nfs_lv2___brq-setup/23c03bb6-9889-4cbf-b7ad-55b9a2c70653/images/0e961108-e830-440e-a613-7739f5852a0a/c1228161-4822-47f9-945e-3273d9fd445a bs=4k'
2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile heconflib._dd_pipe_tar:70 executing: 'tar -tvf -'
2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile heconflib._dd_pipe_tar:88 stdout: -rw-r--r-- 0/0               7 1970-01-01 01:00 version
-rw-r--r-- 0/0               4 1970-01-01 01:00 fhanswers.conf
-rw-r--r-- 0/0            1046 1970-01-01 01:00 hosted-engine.conf
-rw-r--r-- 0/0             182 1970-01-01 01:00 broker.conf
-rw-r--r-- 0/0            1317 1970-01-01 01:00 vm.conf

2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile heconflib._dd_pipe_tar:89 stderr: 
2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile heconflib.extractConfFile:138 extracting 'fhanswers.conf' from '/rhev/data-center/mnt/10.34.63.202:_mnt_export_nfs_lv2___brq-setup/23c03bb6-9889-4cbf-b7ad-55b9a2c70653/images/0e961108-e830-440e-a613-7739f5852a0a/c1228161-4822-47f9-945e-3273d9fd445a'
2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile heconflib._dd_pipe_tar:69 executing: 'sudo -u vdsm dd if=/rhev/data-center/mnt/10.34.63.202:_mnt_export_nfs_lv2___brq-setup/23c03bb6-9889-4cbf-b7ad-55b9a2c70653/images/0e961108-e830-440e-a613-7739f5852a0a/c1228161-4822-47f9-945e-3273d9fd445a bs=4k'
2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile heconflib._dd_pipe_tar:70 executing: 'tar -xOf - fhanswers.conf'
2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile heconflib._dd_pipe_tar:88 stdout: None
2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile heconflib._dd_pipe_tar:89 stderr: 
2016-08-14 03:59:52 DEBUG otopi.plugins.gr_he_common.core.remote_answerfile remote_answerfile._fetch_answer_file:82 Answer file form the shared storage: None
2016-08-14 03:59:52 INFO otopi.plugins.gr_he_common.core.remote_answerfile remote_answerfile._fetch_answer_file:85 Answer file successfully loaded
2016-08-14 03:59:52 DEBUG otopi.context context._executeMethod:142 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-common/core/remote_answerfile.py", line 177, in _customization
    self._parse_answer_file()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-common/core/remote_answerfile.py", line 89, in _parse_answer_file
    self._config.readfp(buf)
  File "/usr/lib64/python2.7/ConfigParser.py", line 324, in readfp
    self._read(fp, filename)
  File "/usr/lib64/python2.7/ConfigParser.py", line 512, in _read
    raise MissingSectionHeaderError(fpname, lineno, line)
MissingSectionHeaderError: File contains no section headers.
file: <???>, line: 1
u'None'
2016-08-14 03:59:52 ERROR otopi.context context._executeMethod:151 Failed to execute stage 'Environment customization': File contains no section headers.

fhanswers.conf is 4 bytes long, most likely corrupt/empty/etc.

Please attach setup/HA/vdsm logs and '/etc/ovirt-hosted-engine/answers.conf' from all HE hosts from around the time it was upgraded from 3.5 to 3.6.

Thanks.

Comment 3 Yedidyah Bar David 2016-08-14 13:32:39 UTC
(These 4 bytes are 'None')

Comment 4 Simone Tiraboschi 2016-08-15 07:30:02 UTC
I strongly suspect that also adding additional hosts to that setup will be problematic. The question is how in the past we lost the answer file creating the configuration volume.

Comment 5 Jiri Belka 2016-08-15 08:21:02 UTC
(In reply to Yedidyah Bar David from comment #2)
> [...]
> 
> Please attach setup/HA/vdsm logs and '/etc/ovirt-hosted-engine/answers.conf'
> from all HE hosts from around the time it was upgraded from 3.5 to 3.6.

We do not have such backup, but we do have only backup done via engine-backup from that time which although does not contain logs and files from hosts as it is engine oriented.

I also do not see any backup files in /etc/ovirt-hosted-engine on hosts.

You can ping me on irc if any other info is needed.

Comment 7 Jiri Belka 2016-08-15 09:25:08 UTC
To clarify for ordinary people :)

[root@slot-2 ~]# file=$(awk -F= '/^conf_volume_UUID/ { print $2 }' /etc/ovirt-hosted-engine/hosted-engine.conf)
[root@slot-2 ~]# find /rhev/data-center/mnt/10.34.63.202:_mnt_export_nfs_lv2___brq-setup -type f -name "$file" | xargs tar tvf
-rw-r--r-- 0/0               7 1970-01-01 01:00 version
-rw-r--r-- 0/0               4 1970-01-01 01:00 fhanswers.conf
-rw-r--r-- 0/0            1046 1970-01-01 01:00 hosted-engine.conf
-rw-r--r-- 0/0             182 1970-01-01 01:00 broker.conf
-rw-r--r-- 0/0            1317 1970-01-01 01:00 vm.conf


vs a HE host where the answer file exists:

[root@10-34-60-215 ~]# file=$(awk -F= '/^conf_volume_UUID/ { print $2 }' /etc/ovirt-hosted-engine/hosted-engine.conf)
[root@10-34-60-215 ~]# find /rhev/data-center/mnt/10.34.63.199:_jbelka_test -type f -name "$file" | xargs tar -tvf
-rw-r--r-- 0/0               7 1970-01-01 01:00 version
-rw-r--r-- 0/0            2454 1970-01-01 01:00 fhanswers.conf
-rw-r--r-- 0/0             739 1970-01-01 01:00 hosted-engine.conf
-rw-r--r-- 0/0             182 1970-01-01 01:00 broker.conf
-rw-r--r-- 0/0            1316 1970-01-01 01:00 vm.conf

Comment 8 Jiri Belka 2016-08-15 14:24:56 UTC
I tried to reproduce 3.5 -> 3.6 SHE migration and fhanswers.conf on storage domain is equal to /etc/ovirt-hosted-engine/answers.conf, thus it has data.

Comment 9 Yedidyah Bar David 2016-08-15 14:41:00 UTC
(In reply to Jiri Belka from comment #8)
> I tried to reproduce 3.5 -> 3.6 SHE migration and fhanswers.conf on storage
> domain is equal to /etc/ovirt-hosted-engine/answers.conf, thus it has data.

Please try this to reproduce:
1. deploy 3.5
2. rm /etc/ovirt-hosted-engine/answers.conf
3. yum update to 3.6

Then check fhanswers.conf on the shared storage after upgrade finishes. You should see this in agent.log:

Upgrading to current version
Saving hosted-engine configuration on the shared storage domain
Configuration file '{path}' not available: {ex}
Successfully moved the configuration to the shared storage

You are then welcome to open a bug on -ha. Not sure exactly what you should write in 'Expected Results', probably that it should fail instead of writing None.

You can try enforcing another upgrade by removing the shared conf volume. This will revert any changes done to engine vm conf, to what's saved in /etc.

Comment 10 Jiri Belka 2016-08-16 10:03:59 UTC
(In reply to Yedidyah Bar David from comment #9)
> (In reply to Jiri Belka from comment #8)
> > I tried to reproduce 3.5 -> 3.6 SHE migration and fhanswers.conf on storage
> > domain is equal to /etc/ovirt-hosted-engine/answers.conf, thus it has data.
> 
> Please try this to reproduce:
> 1. deploy 3.5
> 2. rm /etc/ovirt-hosted-engine/answers.conf
> 3. yum update to 3.6
> 
> Then check fhanswers.conf on the shared storage after upgrade finishes. You
> should see this in agent.log:
> 
> Upgrading to current version
> Saving hosted-engine configuration on the shared storage domain
> Configuration file '{path}' not available: {ex}
> Successfully moved the configuration to the shared storage

It worked fine during my 3.4 -> 3.5 -> 3.6 migration. I see above lines except "Configuration file '{path}' not available: {ex}" in agent.log.

I see that fhanswers.conf is equal to /etc/ovirt-hosted-engine/answers.conf on the host where these lines on agent.log appear.

On other host the file differs.

> You are then welcome to open a bug on -ha. Not sure exactly what you should
> write in 'Expected Results', probably that it should fail instead of writing
> None.
> 
> You can try enforcing another upgrade by removing the shared conf volume.
> This will revert any changes done to engine vm conf, to what's saved in /etc.

Please be more specific how we can "repair" our setup, we need an advice to workaround this current issue.

Comment 11 Yedidyah Bar David 2016-08-16 11:28:21 UTC
(In reply to Jiri Belka from comment #10)
> (In reply to Yedidyah Bar David from comment #9)
> > (In reply to Jiri Belka from comment #8)
> > > I tried to reproduce 3.5 -> 3.6 SHE migration and fhanswers.conf on storage
> > > domain is equal to /etc/ovirt-hosted-engine/answers.conf, thus it has data.
> > 
> > Please try this to reproduce:
> > 1. deploy 3.5
> > 2. rm /etc/ovirt-hosted-engine/answers.conf
> > 3. yum update to 3.6
> > 
> > Then check fhanswers.conf on the shared storage after upgrade finishes. You
> > should see this in agent.log:
> > 
> > Upgrading to current version
> > Saving hosted-engine configuration on the shared storage domain
> > Configuration file '{path}' not available: {ex}
> > Successfully moved the configuration to the shared storage
> 
> It worked fine during my 3.4 -> 3.5 -> 3.6 migration. I see above lines
> except "Configuration file '{path}' not available: {ex}" in agent.log.

Because you didn't try (2.) above.

> 
> I see that fhanswers.conf is equal to /etc/ovirt-hosted-engine/answers.conf
> on the host where these lines on agent.log appear.
> 
> On other host the file differs.
> 
> > You are then welcome to open a bug on -ha. Not sure exactly what you should
> > write in 'Expected Results', probably that it should fail instead of writing
> > None.
> > 
> > You can try enforcing another upgrade by removing the shared conf volume.
> > This will revert any changes done to engine vm conf, to what's saved in /etc.
> 
> Please be more specific how we can "repair" our setup, we need an advice to
> workaround this current issue.

I'd not start trying to repair your setup before we reproduce on a test system to test the workaround.

Comment 12 Jiri Belka 2016-08-16 13:49:25 UTC
(In reply to Yedidyah Bar David from comment #11)
> (In reply to Jiri Belka from comment #10)
> > (In reply to Yedidyah Bar David from comment #9)
> > > (In reply to Jiri Belka from comment #8)
> > > > I tried to reproduce 3.5 -> 3.6 SHE migration and fhanswers.conf on storage
> > > > domain is equal to /etc/ovirt-hosted-engine/answers.conf, thus it has data.
> > > 
> > > Please try this to reproduce:
> > > 1. deploy 3.5
> > > 2. rm /etc/ovirt-hosted-engine/answers.conf
> > > 3. yum update to 3.6
> > > 
> > > Then check fhanswers.conf on the shared storage after upgrade finishes. You
> > > should see this in agent.log:
> > > 
> > > Upgrading to current version
> > > Saving hosted-engine configuration on the shared storage domain
> > > Configuration file '{path}' not available: {ex}
> > > Successfully moved the configuration to the shared storage
> > 
> > It worked fine during my 3.4 -> 3.5 -> 3.6 migration. I see above lines
> > except "Configuration file '{path}' not available: {ex}" in agent.log.
> 
> Because you didn't try (2.) above.
> 
> > 
> > I see that fhanswers.conf is equal to /etc/ovirt-hosted-engine/answers.conf
> > on the host where these lines on agent.log appear.
> > 
> > On other host the file differs.
> > 
> > > You are then welcome to open a bug on -ha. Not sure exactly what you should
> > > write in 'Expected Results', probably that it should fail instead of writing
> > > None.
> > > 
> > > You can try enforcing another upgrade by removing the shared conf volume.
> > > This will revert any changes done to engine vm conf, to what's saved in /etc.
> > 
> > Please be more specific how we can "repair" our setup, we need an advice to
> > workaround this current issue.
> 
> I'd not start trying to repair your setup before we reproduce on a test
> system to test the workaround.

# file=$( awk -F= '/^conf_volume/ { print $2 }' /etc/ovirt-hosted-engine/hosted-engine.conf )

# domain=$( awk -F= '/^sdUUID/ { print $2 }' /etc/ovirt-hosted-engine/hosted-engine.conf )

# find /rhev/data-center/ -path "*/$domain/*" -type f -name "$file" | xargs -I {} tar Oxf {} version
1.3.5.7[root@dell-r210ii-03 ~]# 

# find /rhev/data-center/ -path "*/$domain/*" -type f -name "$file" | xargs -I {} tar Oxf {} fhanswers.conf
None# 
# egrep "(Upgrading to current|Saving hosted-engine|Configuration file|Successfully moved)" /var/log/ovirt-hosted-engine-ha/agent.log 
MainThread::INFO::2016-08-16 15:15:43,539::upgrade::997::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version
MainThread::INFO::2016-08-16 15:16:28,889::upgrade::997::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version
MainThread::INFO::2016-08-16 15:16:45,458::upgrade::408::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_create_conf_tar) Saving hosted-engine configuration on the shared storage domain
MainThread::ERROR::2016-08-16 15:16:45,459::upgrade::396::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Configuration file '/etc/ovirt-hosted-engine/answers.conf' not available: [Errno 2] No such file or directory: '/etc/ovirt-hosted-engine/answers.conf'
MainThread::INFO::2016-08-16 15:16:45,526::upgrade::975::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_move_to_shared_conf) Successfully moved the configuration to the shared storage

# grep ovirt-hosted /var/log/yum.log 
Aug 16 11:46:57 Installed: ovirt-hosted-engine-ha-1.2.10-1.el7ev.noarch
Aug 16 11:46:58 Installed: ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch
Aug 16 15:14:39 Updated: ovirt-hosted-engine-ha-1.3.5.7-1.el7ev.noarch
Aug 16 15:14:40 Updated: ovirt-hosted-engine-setup-1.3.7.2-1.el7ev.noarch

Comment 13 Yedidyah Bar David 2016-08-16 14:05:39 UTC
Great. Please attach from host: /etc/ovirt-hosted* /var/log/ovirt-hosted* . Thanks.

Comment 16 Jiri Belka 2016-08-17 08:54:01 UTC
As a consequence of above proposed steps, this is what happened on a non-SPM host with updated rpms to 4.0:

# hosted-engine --upgrade-appliance
/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py:15: DeprecationWarning: vdscli uses xmlrpc. since ovirt 3.6 xmlrpc is deprecated, please use vdsm.jsonrpcvdscli
  import vdsm.vdscli
[ INFO  ] Stage: Initializing
[ ERROR ] Failed to execute stage 'Initializing': 'Configuration value not found: file=/etc/ovirt-hosted-engine/hosted-engine.conf, key=conf_image_UUID'
[ INFO  ] Stage: Clean up
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
[ ERROR ] Hosted Engine deployment failed
          Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20160817105124-um2nww.log


conf_image_UUID really does not exists in /etc/ovirt-hosted-engine/hosted-engine.conf on this host but it does exists in the file on other host which is still 3.6 and which used to be one used for migration from 3.5 -> 3.6.

Comment 17 Jiri Belka 2016-08-17 09:31:59 UTC
So I updated rpms on the other host which used to be one used for migration from 3.5 -> 3.6 and where I deleted /etc/ovirt-hosted-engine/answers.conf as requested and it finished in same issue:

...
[ INFO  ] Answer file successfully loaded
[ ERROR ] Failed to execute stage 'Environment customization': File contains no section headers. file: <???>, line: 1 u'None'
[ INFO  ] Stage: Clean up
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
[ ERROR ] Hosted Engine upgrade failed
...

Comment 18 Yedidyah Bar David 2016-08-17 10:32:41 UTC
So we have a reproduction.

I think the following might be a workaround. This will remove the configuration volume from the shared storage and force HA to re-create it.

1. Choose one host to do this on. Make sure its local configuration
(in /etc/ovirt-hosted-engine) is up-to-date. You can compare with what's on the shared storage and other hosts.
2. Move all hosts to local maintenance and stop HA services on all of them.
3. Delete the configuration volume. It should be something like:
/rhev/data-center/mnt/*/$sdUUID/images/$conf_image_UUID/$conf_volume_UUID

Where sdUUID conf_image_UUID conf_volume_UUID are taken from /etc/ovirt-hosted-engine/hosted-engine.conf .

4. Edit /etc/ovirt-hosted-engine/hosted-engine.conf as follows:
Remove lines starting with:
conf_volume_UUID=
conf_image_UUID=
vm_disk_vol_id=

Edit the line:
spUUID=00000000-0000-0000-0000-000000000000
to be:
spUUID=$POOL_UUID
where POOL_UUID is from /rhev/data-center/mnt/didi-lap:_he1/$sdUUID/dom_md/metadata

5. Start HA services on this host.
6. Monitor agent.log to see the upgrade flow as in comment 12, hopefully this time without an error.
7. Compare /etc/ovirt-hosted-engine/hosted-engine.conf with what you had before. The new file should have new IDs for conf_*_UUID and spUUID=00000000-0000-0000-0000-000000000000.
8. Start HA on all other hosts and move all hosts out of local maintenance.

When finished, please attach /etc/ovirt-hosted* /var/log/ovirt-hosted* from all hosts. Setting needinfo also on Simone to review. Thanks.

Comment 19 Jiri Belka 2016-08-17 10:40:35 UTC
I created a BZ about checking if content of conf_volume tarball is valid https://bugzilla.redhat.com/show_bug.cgi?id=1367732

Comment 20 Simone Tiraboschi 2016-08-17 13:15:32 UTC
The proposed workaround seams correct and complete but we have to properly test it.

As far as I understood, this issue could only happen if the initial answer file from setup time has been removed from the host before 3.5 -> 3.6 upgrade.
In that case the upgrade procedure should fail reporting the issue while it's probably just writing an empty/dumb answer file to the shared storage which lead to future issues as this bug.

Comment 21 Jiri Belka 2016-08-17 21:18:54 UTC
(In reply to Yedidyah Bar David from comment #18)

> 4. Edit /etc/ovirt-hosted-engine/hosted-engine.conf as follows:
> Remove lines starting with:
> conf_volume_UUID=
> conf_image_UUID=
> vm_disk_vol_id=
> 
> Edit the line:
> spUUID=00000000-0000-0000-0000-000000000000
> to be:
> spUUID=$POOL_UUID
> where POOL_UUID is from
> /rhev/data-center/mnt/didi-lap:_he1/$sdUUID/dom_md/metadata

I do not see POOL_UUID value in the path which contains $sdUUID.

Any idea?

# grep -E '^(sdUUID|conf_)' /etc/ovirt-hosted-engine/hosted-engine.conf.orig 
sdUUID=990f8f44-a511-4a46-9f8c-468ca9eda05d
conf_volume_UUID=bd8c5cb7-4898-4108-a3f1-773c7c9f4cf5
conf_image_UUID=de2f46cf-5143-4db8-9122-078d2e4cac0d

# find /rhev/data-center/ -path "*/$domain/*" -type f -name 'metadata' | xargs cat
CLASS=Data
DESCRIPTION=hosted_storage
IOOPTIMEOUTSEC=10
LEASERETRIES=3
LEASETIMESEC=60
LOCKPOLICY=ON
LOCKRENEWALINTERVALSEC=5
MASTER_VERSION=1
POOL_DESCRIPTION=hosted_datacenter
POOL_DOMAINS=990f8f44-a511-4a46-9f8c-468ca9eda05d:Active
POOL_SPM_ID=-1
POOL_SPM_LVER=-1
POOL_UUID=
REMOTE_PATH=10.34.63.199:/jbelka/jb-she_test
ROLE=Regular
SDUUID=990f8f44-a511-4a46-9f8c-468ca9eda05d
TYPE=NFS
VERSION=3
_SHA_CKSUM=55080f2b6960048c2d653b846ffad999f4feb825

# find /rhev/data-center/ -type f -name 'metadata' | xargs grep '^POOL_UUID'
/rhev/data-center/mnt/10.34.63.199:_jbelka_jb-she__test/990f8f44-a511-4a46-9f8c-468ca9eda05d/dom_md/metadata:POOL_UUID=
/rhev/data-center/mnt/_var_lib_ovirt-hosted-engine-ha_tmpRqSNVH/8a1780bf-7ae0-46a6-918e-c4f06b44b2b0/dom_md/metadata:POOL_UUID=4ed71ad1-4ad6-4278-be92-a26e67f98f22
/rhev/data-center/mnt/10.34.63.199:_jbelka_jb-she__test-data/d12f4e3a-4637-422c-95f1-5816bdf01f22/dom_md/metadata:POOL_UUID=00000002-0002-0002-0002-0000000001dd

Comment 22 Yedidyah Bar David 2016-08-18 05:30:16 UTC
One thing I know for sure: Do not leave
spUUID=00000000-0000-0000-0000-000000000000
because that's the flag for the upgrade process to know that it was done.
Other that that, not sure - perhaps just remove the line, or pub some random uuid. Keeping needinfo, perhaps Simone knows better.
I think it will simply fail without doing harm if you do something "wrong", so you can simply try various things.

Comment 23 Simone Tiraboschi 2016-08-18 09:48:43 UTC
The upgrade procedure is also trying to detach the hosted-engine storage domain from its bootstrap storage pool to let the engine import it so the fixing procedure reported on comment 18 will fail since our storage domain was already detached.
Reattaching the hosted-engine storage domain to its bootstrap storage pool is far to complex and risky since the engine already imported it.

An easiest procedure is to simply re-write the existing configuration volume including the correct answer file.

This script will do the job:
 #!/bin/sh
 dir=`mktemp -d` && cd $dir
 sdUUID=$(awk -F= '/^sdUUID/ { print $2 }' /etc/ovirt-hosted-engine/hosted-engine.conf)
 conf_volume_UUID=$(awk -F= '/^conf_volume_UUID/ { print $2 }' /etc/ovirt-hosted-engine/hosted-engine.conf)
 conf_image_UUID=$(awk -F= '/^conf_image_UUID/ { print $2 }' /etc/ovirt-hosted-engine/hosted-engine.conf)
 systemctl stop ovirt-ha-broker # on all hosts!
 find /rhev/data-center/ -path "*/${sdUUID}/images/${conf_image_UUID}/${conf_volume_UUID}" -type f -exec sh -c 'sudo -u vdsm dd if=$1 2>/dev/null | tar -xvf - 2>/dev/null' {} {} \;
 cat /etc/ovirt-hosted-engine/answers.conf > fhanswers.conf # the source file should be sane
 find /rhev/data-center/ -path "*/${sdUUID}/images/${conf_image_UUID}/${conf_volume_UUID}" -type f -exec sh -c 'tar -cf- * | sudo -u vdsm dd of=$1 2>/dev/null' {} {} \;
 systemctl restart ovirt-ha-agent # on all hosts!

Comment 24 Jiri Belka 2016-08-18 11:33:30 UTC
(In reply to Simone Tiraboschi from comment #23)
> The upgrade procedure is also trying to detach the hosted-engine storage
> domain from its bootstrap storage pool to let the engine import it so the
> fixing procedure reported on comment 18 will fail since our storage domain
> was already detached.
> Reattaching the hosted-engine storage domain to its bootstrap storage pool
> is far to complex and risky since the engine already imported it.
> 
> An easiest procedure is to simply re-write the existing configuration volume
> including the correct answer file.
> 
> This script will do the job:
>  #!/bin/sh
>  dir=`mktemp -d` && cd $dir
>  sdUUID=$(awk -F= '/^sdUUID/ { print $2 }'
> /etc/ovirt-hosted-engine/hosted-engine.conf)
>  conf_volume_UUID=$(awk -F= '/^conf_volume_UUID/ { print $2 }'
> /etc/ovirt-hosted-engine/hosted-engine.conf)
>  conf_image_UUID=$(awk -F= '/^conf_image_UUID/ { print $2 }'
> /etc/ovirt-hosted-engine/hosted-engine.conf)
>  systemctl stop ovirt-ha-broker # on all hosts!
>  find /rhev/data-center/ -path
> "*/${sdUUID}/images/${conf_image_UUID}/${conf_volume_UUID}" -type f -exec sh
> -c 'sudo -u vdsm dd if=$1 2>/dev/null | tar -xvf - 2>/dev/null' {} {} \;
>  cat /etc/ovirt-hosted-engine/answers.conf > fhanswers.conf # the source
> file should be sane
>  find /rhev/data-center/ -path
> "*/${sdUUID}/images/${conf_image_UUID}/${conf_volume_UUID}" -type f -exec sh
> -c 'tar -cf- * | sudo -u vdsm dd of=$1 2>/dev/null' {} {} \;
>  systemctl restart ovirt-ha-agent # on all hosts!

This workaround made me to be able to migration to 4.0 successfully.

Comment 25 Yaniv Lavi 2016-08-18 11:49:01 UTC
Why did you clone to upstream? What is the diff between he bugs?

Comment 26 Simone Tiraboschi 2016-08-18 12:13:25 UTC
1367732 is on ovirt-hosted-engine-ha: refusing to upgrade (3.5->3.6) if the answerfile is missing on the host instead of writing a dumb answerfile on the shared storage

1366879 is on ovirt-hosted-engine-setup: providing a clear error message if the answerfile on the shared storage is not consumable

Comment 27 Yaniv Lavi 2016-08-18 12:19:24 UTC
(In reply to Simone Tiraboschi from comment #26)
> 1367732 is on ovirt-hosted-engine-ha: refusing to upgrade (3.5->3.6) if the
> answerfile is missing on the host instead of writing a dumb answerfile on
> the shared storage
> 
> 1366879 is on ovirt-hosted-engine-setup: providing a clear error message if
> the answerfile on the shared storage is not consumable

Ok, please make sure this goes in to 3.6.9 as well.

Comment 28 Simone Tiraboschi 2016-08-19 10:07:44 UTC
(In reply to Yaniv Dary from comment #27)
> Ok, please make sure this goes in to 3.6.9 as well.

Cloned

Comment 29 Simone Tiraboschi 2016-08-23 10:10:11 UTC
This bug is a side effects of rhbz#1367732

These two bugs can only occurs if the systems has been initially
deployed on 3.5 or before: at that point the answer file for hosted-engine setup was on /etc/ovirt-hosted-engine/answers.conf

The root cause for this two bug, and the only way we found to
reproduce it, is that the user has manually deleted the answer file on
the host before upgrading to 3.6.

The upgrade procedure from 3.5 -> 3.6 should copy it to the shared storage.
Due to bug 1367732, the 3.5 -> 3.6 upgrade didn't stop if the answer
file was missing and simply wrote an answer file with just 'None'
inside the shared storage and this could lead to future issues.
The patch for bug 1367732 will prevent this: the upgrade will not be performed till the user restores the answer file on his host.

Future issues caused by the dumb answer file are tracked on this bug: the upgrade of the engine appliance to 4.0 (or also adding a
new hosted-engine host from CLI) will fail if
ovirt-hosted-engine-setup fails parsing the answer file on the shared
storage.
With the proposed patch hosted-engine-setup is simply emitting a clearer
error message if the answer file on the shared storage is not valid.
It's not going to auto-recovery the missing answer
file since we don't have and we cannot guess all the required info.
The most reasonable recovery action is to manually recover the lost
answer file (from a backup or from another host) and copy it to the shared storage.
Here we proposed a recovery script for that:
https://bugzilla.redhat.com/show_bug.cgi?id=1366879#c23

Comment 31 Yedidyah Bar David 2016-08-24 05:41:35 UTC
(In reply to Simone Tiraboschi from comment #29)
> The root cause for this two bug, and the only way we found to
> reproduce it, is that the user has manually deleted the answer file on
> the host before upgrading to 3.6.

Another similar flow which we didn't see IIUC but seems possible is that the host that was used for upgrade didn't have the file originally. This can happen if the initial deploy failed at some later stage but before writing this file.

Comment 36 Nikolai Sednev 2016-08-30 13:20:06 UTC
Works for me on these components on host:
libvirt-client-1.2.17-13.el7_2.5.x86_64
ovirt-imageio-common-0.3.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
ovirt-hosted-engine-ha-2.0.3-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.21.x86_64
sanlock-3.2.4-3.el7_2.x86_64
rhevm-appliance-20160731.0-1.el7ev.noarch
ovirt-setup-lib-1.0.2-1.el7ev.noarch
ovirt-hosted-engine-setup-2.0.1.5-1.el7ev.noarch
mom-0.5.5-1.el7ev.noarch
ovirt-host-deploy-1.5.1-1.el7ev.noarch
vdsm-4.18.11-1.el7ev.x86_64
rhev-release-3.6.9-1-001.noarch
ovirt-imageio-daemon-0.3.0-0.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
rhev-release-4.0.3-1-001.noarch
ovirt-engine-sdk-python-3.6.8.0-1.el7ev.noarch
Linux version 3.10.0-327.36.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Wed Aug 17 03:02:37 EDT 2016
Linux 3.10.0-327.36.1.el7.x86_64 #1 SMP Wed Aug 17 03:02:37 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Engine:
ovirt-engine-dwh-setup-4.0.2-1.el7ev.noarch
ovirt-image-uploader-4.0.0-1.el7ev.noarch
ovirt-imageio-proxy-setup-0.3.0-0.el7ev.noarch
ovirt-engine-webadmin-portal-4.0.3-0.1.el7ev.noarch
ovirt-engine-restapi-4.0.3-0.1.el7ev.noarch
ovirt-host-deploy-1.5.1-1.el7ev.noarch
ovirt-engine-extension-aaa-jdbc-1.1.0-1.el7ev.noarch
ovirt-engine-cli-3.6.8.1-1.el7ev.noarch
ovirt-engine-websocket-proxy-4.0.3-0.1.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
ovirt-setup-lib-1.0.2-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.8.0-1.el7ev.noarch
ovirt-log-collector-4.0.0-1.el7ev.noarch
ovirt-imageio-proxy-0.3.0-0.el7ev.noarch
ovirt-engine-tools-4.0.3-0.1.el7ev.noarch
ovirt-engine-setup-base-4.0.3-0.1.el7ev.noarch
ovirt-engine-setup-plugin-ovirt-engine-common-4.0.3-0.1.el7ev.noarch
ovirt-engine-setup-plugin-ovirt-engine-4.0.3-0.1.el7ev.noarch
python-ovirt-engine-sdk4-4.0.0-0.5.a5.el7ev.x86_64
ovirt-iso-uploader-4.0.0-1.el7ev.noarch
ovirt-imageio-common-0.3.0-0.el7ev.noarch
ovirt-engine-dashboard-1.0.3-1.el7ev.x86_64
ovirt-engine-userportal-4.0.3-0.1.el7ev.noarch
ovirt-engine-4.0.3-0.1.el7ev.noarch
ovirt-host-deploy-java-1.5.1-1.el7ev.noarch
ovirt-engine-lib-4.0.3-0.1.el7ev.noarch
ovirt-engine-setup-plugin-websocket-proxy-4.0.3-0.1.el7ev.noarch
ovirt-engine-setup-4.0.3-0.1.el7ev.noarch
ovirt-engine-vmconsole-proxy-helper-4.0.3-0.1.el7ev.noarch
ovirt-engine-tools-backup-4.0.3-0.1.el7ev.noarch
ovirt-vmconsole-proxy-1.0.4-1.el7ev.noarch
ovirt-engine-dbscripts-4.0.3-0.1.el7ev.noarch
ovirt-engine-dwh-4.0.2-1.el7ev.noarch
ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.0.3-0.1.el7ev.noarch
ovirt-engine-extensions-api-impl-4.0.3-0.1.el7ev.noarch
ovirt-engine-backend-4.0.3-0.1.el7ev.noarch
rhevm-spice-client-x86-msi-4.0-3.el7ev.noarch
rhevm-doc-4.0.0-3.el7ev.noarch
rhevm-spice-client-x64-msi-4.0-3.el7ev.noarch
rhev-guest-tools-iso-4.0-5.el7ev.noarch
rhevm-4.0.3-0.1.el7ev.noarch
rhevm-branding-rhev-4.0.0-5.el7ev.noarch
rhevm-guest-agent-common-1.0.12-3.el7ev.noarch
rhevm-dependencies-4.0.0-1.el7ev.noarch
rhevm-setup-plugins-4.0.0.2-1.el7ev.noarch
rhev-release-4.0.3-1-001.noarch
Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

During hosted-engine --upgrade-appliance I've used the rhevm-appliance-20160731.0-1.el7ev.noarch, then updated engine's repos and installed the latest 4.0.3 bits.

Comment 37 Yedidyah Bar David 2016-08-30 13:26:51 UTC
AFAIU, current bug is about bad answer file in shared storage, and verification should have started from a 3.6 system in that state, and result should have been a nicer error message. No?

Comment 38 Nikolai Sednev 2016-08-30 13:33:03 UTC
Created attachment 1195906 [details]
Picture of extending the hosted storage via WEBUI

Comment 39 Nikolai Sednev 2016-08-30 13:36:42 UTC
Please add these missing steps to documentation:
1)Customer might have to extend the hosted-storage as appears within the attachment, prior to upgrade.
2)Add that host's components should be 4.0.3 and not 3.6.9, otherwise "--upgrade-appliance" functionality won't be available on host.

Comment 40 Nikolai Sednev 2016-08-30 13:57:48 UTC
(In reply to Yedidyah Bar David from comment #37)
> AFAIU, current bug is about bad answer file in shared storage, and
> verification should have started from a 3.6 system in that state, and result
> should have been a nicer error message. No?

The environment really started from 3.6.9, then it was upgraded to 4.0.3.

Comment 41 Lucy Bopf 2016-08-31 06:37:25 UTC
(In reply to Nikolai Sednev from comment #39)
> Please add these missing steps to documentation:
> 1)Customer might have to extend the hosted-storage as appears within the
> attachment, prior to upgrade.
> 2)Add that host's components should be 4.0.3 and not 3.6.9, otherwise
> "--upgrade-appliance" functionality won't be available on host.

Hi Nikolai,

Thanks for letting me know about the documentation requirements. To clarify those two points:

1) Does this mean that a customer may need to add additional space to their shared self-hosted engine storage? Do we have any idea how much extra space is required? It would be clearer to give a minimum storage value, so a customer could check whether more space is required before they begin the upgrade.

2) Does this mean that 'rhel-7-server-rhv-4.0-rpms' and 'rhel-7-server-rhv-4-mgmt-agent-rpms' must be enabled on the host, and the 'ovirt-hosted-engine-setup' package updated before the user can run 'hosted-engine --upgrade-appliance'? I assume this would apply only to updating RHEL hosts, because new RHVH hosts would have the required package versions already.

Our current documentation is available here: https://access.redhat.com/documentation/en/red-hat-virtualization/4.0/single/self-hosted-engine-guide#Upgrading_the_Self-Hosted_Engine

It may be easier to point to certain places in the current documentation where you think a change is required.

Comment 42 Nikolai Sednev 2016-08-31 06:59:45 UTC
(In reply to Lucy Bopf from comment #41)
> (In reply to Nikolai Sednev from comment #39)
> > Please add these missing steps to documentation:
> > 1)Customer might have to extend the hosted-storage as appears within the
> > attachment, prior to upgrade.
> > 2)Add that host's components should be 4.0.3 and not 3.6.9, otherwise
> > "--upgrade-appliance" functionality won't be available on host.
> 
> Hi Nikolai,
> 
> Thanks for letting me know about the documentation requirements. To clarify
> those two points:
> 
> 1) Does this mean that a customer may need to add additional space to their
> shared self-hosted engine storage? Do we have any idea how much extra space
> is required? It would be clearer to give a minimum storage value, so a
> customer could check whether more space is required before they begin the
> upgrade.
> 
> 2) Does this mean that 'rhel-7-server-rhv-4.0-rpms' and
> 'rhel-7-server-rhv-4-mgmt-agent-rpms' must be enabled on the host, and the
> 'ovirt-hosted-engine-setup' package updated before the user can run
> 'hosted-engine --upgrade-appliance'? I assume this would apply only to
> updating RHEL hosts, because new RHVH hosts would have the required package
> versions already.
> 
> Our current documentation is available here:
> https://access.redhat.com/documentation/en/red-hat-virtualization/4.0/single/
> self-hosted-engine-guide#Upgrading_the_Self-Hosted_Engine
> 
> It may be easier to point to certain places in the current documentation
> where you think a change is required.

1-Yes, in case of lack of storage on hosted storage iSCSI LUN, user may expand the storage using their storage appliances, like I did, using XIO storage appliance, then I've expanded hosted storage within the engine via WEBUI, as appears in attachment. In my case I had 75G of initial storage, and after I had 3.6.9 running, I had only 20G left, while minimum of additionally 50G was required, as I saw from my failed upgrade shell screen. I've expanded the LUN to have 150G just in case, then as appears in attachment, I had additional +75G of storage. I've clicked on that "button" and then on "Ok", then re-ran the upgrade using "hosted-engine --upgrade-appliance" command.   

2-As you probably may see from https://bugzilla.redhat.com/show_bug.cgi?id=1366879#c35, I did not seen the "--upgrade-appliance" option at all, while I was running on 3.6.9's components on my el7.2 host. Only after I've made changes to my host's repos to match with 4.0.3 and then updated them, I could get the option as required. In my case host have to be upgraded to 4.0.3 first, then "hosted-engine --upgrade-appliance" can be initiated and of course, you will need the appliance installed prior to running "hosted-engine --upgrade-appliance" on your host.

Comment 43 Simone Tiraboschi 2016-08-31 07:29:33 UTC
(In reply to Lucy Bopf from comment #41)
> Thanks for letting me know about the documentation requirements. To clarify
> those two points:

1. It requires enough free space to contain a copy of the existing engine VM disk. Normally it's not an issue on NFS/GLuster but it could require to manually expand the LUN used for the hosted-engine storage domain as Nikolai described.

The backup disk will not be automatically deleted at the end of the upgrade; it's up to the user to destroy it when he is sure that everything is OK.

2. Yes, it's required to upgrade the host to 4.0 to gain the new feature.

Comment 44 Marina Kalinin 2016-10-28 18:54:54 UTC
Sandro/Simone, in which d/s release was this fixed in 4.0?
Can you please link errata with the correct version?

Comment 45 Simone Tiraboschi 2016-10-28 22:26:58 UTC
2.0.1.5 as for bug https://bugzilla.redhat.com/show_bug.cgi?id=1369712 ( in the "Blocks" section ) with the errata linked there ( https://rhn.redhat.com/errata/RHBA-2016-1801.html )


Note You need to log in before you can comment on or make changes to this bug.