Bug 1416050

Summary:	[downstream clone - 4.0.7] engine-setup refuses to run over a DB restored from an hosted-engine env if it wasn't in global maintenance mode at backup time
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	rhev-integ
Component:	ovirt-engine	Assignee:	Simone Tiraboschi <stirabos>
Status:	CLOSED ERRATA	QA Contact:	Artyom <alukiano>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	unspecified	CC:	alukiano, bgraveno, bugs, dfediuck, didi, lsurette, mavital, mgoldboi, nsednev, rbalakri, Rhev-m-bugs, srevivo, stirabos, ykaul
Target Milestone:	ovirt-4.0.7	Keywords:	Regression, Triaged, ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	This update fixes an issue where engine-setup would not run over a restored database if the backup was taken from a hosted-engine environment that was not in global maintenance mode.	Story Points:	---
Clone Of:	1403903	Environment:
Last Closed:	2017-03-16 15:31:18 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Integration	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1403903
Bug Blocks:

Description rhev-integ 2017-01-24 13:41:57 UTC

+++ This bug is an upstream to downstream clone. The original bug is: +++
+++   bug 1403903 +++
======================================================================

Description of problem:
Engine setup failed on the appliance. Followed steps from https://bugzilla.redhat.com/show_bug.cgi?id=1399053 in order to verify it, but failed with:
[ ERROR ] Engine setup failed on the appliance
[ ERROR ] Failed to execute stage 'Closing up': Engine setup failed on the appliance Please check its log on the appliance. 
[ ERROR ] Hosted Engine upgrade failed: this system is not reliable, you can use --rollback-upgrade option to recover the engine VM disk from a backup
          Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20161212162707-5ujgti.log


Version-Release number of selected component (if applicable):
Host:
rhevm-appliance-20161130.0-1.el7ev.noarch
vdsm-4.18.18-4.git198e48d.el7ev.x86_64
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
ovirt-imageio-daemon-0.4.0-0.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.2.x86_64
ovirt-hosted-engine-ha-2.0.6-1.el7ev.noarch
ovirt-setup-lib-1.0.2-1.el7ev.noarch
ovirt-imageio-common-0.3.0-0.el7ev.noarch
libvirt-client-2.0.0-10.el7_3.2.x86_64
ovirt-vmconsole-1.0.4-1.el7ev.noarch
mom-0.5.8-1.el7ev.noarch
ovirt-hosted-engine-setup-2.0.4.1-2.el7ev.noarch
ovirt-host-deploy-1.5.3-1.el7ev.noarch
Linux version 3.10.0-514.2.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Wed Nov 16 13:15:13 EST 2016
Linux 3.10.0-514.2.2.el7.x86_64 #1 SMP Wed Nov 16 13:15:13 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

How reproducible:
100%

Steps to Reproduce:
1.Get 3.4.0.6 HE deployed on host with two NFS data storage domains and over NFS.
2.Make backup for your engine, copy it to your host to none root directory, while your host set to maintenance.
3.Install rhevm-appliance-20161130.0-1.el7ev.noarch on your host and run hosted-engine --upgrade-appliance


Actual results:
Upgrade failed.

Expected results:
Upgrade should pass.

Additional info:
See the attached sosreport from host and ovirt-hosted-engine-setup-20161212162707-5ujgti.log

(Originally by Nikolai Sednev)

Comment 1 rhev-integ 2017-01-24 13:42:06 UTC

Created attachment 1230829 [details]
ovirt-hosted-engine-setup-20161212162707-5ujgti.log

(Originally by Nikolai Sednev)

Comment 3 rhev-integ 2017-01-24 13:42:12 UTC

Adding sosreport from host here https://drive.google.com/a/redhat.com/file/d/0B85BEaDBcF88NDByaFhiZ3NwMW8/view?usp=sharing

(Originally by Nikolai Sednev)

Comment 4 rhev-integ 2017-01-24 13:42:18 UTC

1. What's the difference between this bz and bug 1399053?
2. Which upgrade documentation are you following?

(Originally by Doron Fediuck)

Comment 5 rhev-integ 2017-01-24 13:42:24 UTC

(In reply to Doron Fediuck from comment #3)
> 1. What's the difference between this bz and bug 1399053?
> 2. Which upgrade documentation are you following?

1.I was trying to verify the 1399053, but as I've failed, so opened this bug.
2.I'm following the same steps for engine's db backup, which worked before, e.g. making engine db's backup, copying those to host on which I'm trying to make an upgrade, then during upgrade of the engine, providing the path to backup files on host.

(Originally by Nikolai Sednev)

Comment 6 rhev-integ 2017-01-24 13:42:31 UTC

(In reply to Nikolai Sednev from comment #4)
> (In reply to Doron Fediuck from comment #3)
> > 1. What's the difference between this bz and bug 1399053?
> > 2. Which upgrade documentation are you following?
> 
> 1.I was trying to verify the 1399053, but as I've failed, so opened this bug.

This is indeed a different issue. The root cause here is:
2016-12-12 16:27:07 DEBUG otopi.plugins.otopi.packagers.dnfpackager dnfpackager._boot:163 Cannot initialize minidnf
Traceback (most recent call last):
  File "/usr/share/otopi/plugins/otopi/packagers/dnfpackager.py", line 150, in _boot
    constants.PackEnv.DNF_DISABLED_PLUGINS
  File "/usr/share/otopi/plugins/otopi/packagers/dnfpackager.py", line 60, in _getMiniDNF
    from otopi import minidnf
  File "/usr/lib/python2.7/site-packages/otopi/minidnf.py", line 16, in <module>
    import dnf
ImportError: No module named dnf

> 2.I'm following the same steps for engine's db backup, which worked before,
> e.g. making engine db's backup, copying those to host on which I'm trying to
> make an upgrade, then during upgrade of the engine, providing the path to
> backup files on host.
We're now ensuring upgrade flows use the supported documentation, since there are multiple environments (el6/el7, HE/non-HE, 3.5/3.6). So unless this is a standard backup and restore (to the same version) please use only the relevant upgrade flow.

(Originally by Doron Fediuck)

Comment 7 rhev-integ 2017-01-24 13:42:37 UTC

(In reply to Doron Fediuck from comment #5)
> (In reply to Nikolai Sednev from comment #4)
> > (In reply to Doron Fediuck from comment #3)
> > > 1. What's the difference between this bz and bug 1399053?
> > > 2. Which upgrade documentation are you following?
> > 
> > 1.I was trying to verify the 1399053, but as I've failed, so opened this bug.
> 
> This is indeed a different issue. The root cause here is:
> 2016-12-12 16:27:07 DEBUG otopi.plugins.otopi.packagers.dnfpackager
> dnfpackager._boot:163 Cannot initialize minidnf
> Traceback (most recent call last):
>   File "/usr/share/otopi/plugins/otopi/packagers/dnfpackager.py", line 150,
> in _boot
>     constants.PackEnv.DNF_DISABLED_PLUGINS
>   File "/usr/share/otopi/plugins/otopi/packagers/dnfpackager.py", line 60,
> in _getMiniDNF
>     from otopi import minidnf
>   File "/usr/lib/python2.7/site-packages/otopi/minidnf.py", line 16, in
> <module>
>     import dnf
> ImportError: No module named dnf
> 
> > 2.I'm following the same steps for engine's db backup, which worked before,
> > e.g. making engine db's backup, copying those to host on which I'm trying to
> > make an upgrade, then during upgrade of the engine, providing the path to
> > backup files on host.
> We're now ensuring upgrade flows use the supported documentation, since
> there are multiple environments (el6/el7, HE/non-HE, 3.5/3.6). So unless
> this is a standard backup and restore (to the same version) please use only
> the relevant upgrade flow.

2.In this case I've tried backing up and restoring the same db on the same appliance, just to see if /root issue was fixed, before getting a bit far and make the whole upgrade flow.

(Originally by Nikolai Sednev)

Comment 8 rhev-integ 2017-01-24 13:42:43 UTC

(In reply to Doron Fediuck from comment #5)

> This is indeed a different issue. The root cause here is:
> 2016-12-12 16:27:07 DEBUG otopi.plugins.otopi.packagers.dnfpackager
> dnfpackager._boot:163 Cannot initialize minidnf
> Traceback (most recent call last):
>   File "/usr/share/otopi/plugins/otopi/packagers/dnfpackager.py", line 150,
> in _boot
>     constants.PackEnv.DNF_DISABLED_PLUGINS
>   File "/usr/share/otopi/plugins/otopi/packagers/dnfpackager.py", line 60,
> in _getMiniDNF
>     from otopi import minidnf
>   File "/usr/lib/python2.7/site-packages/otopi/minidnf.py", line 16, in
> <module>
>     import dnf
> ImportError: No module named dnf

This is not critical. In RHEL and CentOS there's no dnf and yum is used as fallback.

(Originally by Sandro Bonazzola)

Comment 9 rhev-integ 2017-01-24 13:42:49 UTC

As per instructions provided on the failure:
RuntimeError: Engine setup failed on the appliance
Please check its log on the appliance.

Can you please attach engine setup logs from within the appliance?

(Originally by Sandro Bonazzola)

Comment 10 rhev-integ 2017-01-24 13:42:54 UTC

Created attachment 1231169 [details]
ovirt-engine-setup-20161212190249-4t6x44.log

(Originally by Nikolai Sednev)

Comment 11 rhev-integ 2017-01-24 13:43:00 UTC

The real issue is here:
 2016-12-12 19:02:49 DEBUG otopi.ovirt_engine_setup.engine_common.database database.getCredentials:164 dbenv: {'OVESETUP_DWH_DB/database': 'ovirt_engine_history', 'OVESETUP_DWH_DB/host': 'localhost', 'OVESETUP_DWH_DB/port': 5432, 'OVESETUP_DWH_DB/securedHostValidation': False, 'OVESETUP_DWH_DB/secured': False, 'OVESETUP_DWH_DB/password': '1', 'OVESETUP_DWH_DB/user': 'ovirt_engine_history'}
 2016-12-12 19:02:49 DEBUG otopi.ovirt_engine_setup.engine_common.database database.execute:177 Database: 'None', Statement: '
                 select count(*) as count
                 from pg_catalog.pg_tables
                 where schemaname = 'public';
             ', args: {}
 2016-12-12 19:02:49 DEBUG otopi.ovirt_engine_setup.engine_common.database database.execute:182 Creating own connection
 2016-12-12 19:02:49 DEBUG otopi.ovirt_engine_setup.engine_common.database database.getCredentials:189 database connection failed
 Traceback (most recent call last):
   File "/usr/share/ovirt-engine/setup/ovirt_engine_setup/engine_common/database.py", line 187, in getCredentials
     ] = self.isNewDatabase()
   File "/usr/share/ovirt-engine/setup/ovirt_engine_setup/engine_common/database.py", line 370, in isNewDatabase
     transaction=False,
   File "/usr/share/ovirt-engine/setup/ovirt_engine_setup/engine_common/database.py", line 191, in execute
     database=database,
   File "/usr/share/ovirt-engine/setup/ovirt_engine_setup/engine_common/database.py", line 125, in connect
     sslmode=sslmode,
   File "/usr/lib64/python2.7/site-packages/psycopg2/__init__.py", line 164, in connect
     conn = _connect(dsn, connection_factory=connection_factory, async=async)
 OperationalError: could not connect to server: Connection refused
 	Is the server running on host "localhost" (::1) and accepting
 	TCP/IP connections on port 5432?
 could not connect to server: Connection refused
 	Is the server running on host "localhost" (127.0.0.1) and accepting
 	TCP/IP connections on port 5432?

(Originally by Simone Tiraboschi)

Comment 12 rhev-integ 2017-01-24 13:43:06 UTC

Nikolay, you uploaded the wrong engine-setup log file.

The real issue was here on hosted-engine-setup logs:
2016-12-12 16:57:16 DEBUG otopi.plugins.otopi.dialog.human dialog.__logString:204 DIALOG:SEND                 |- [ ERROR ] It seems that you are running your engine inside of the hosted-engine VM and are not in "Global Maintenance" mode. In that case you should put the system into the "Global Maintenance" mode before running engine-setup, or the hosted-engine HA agent might kill the machine, which might corrupt your data. 

hosted-engine-setup checked for global maintenance mode and it was fine:
2016-12-12 16:27:11 INFO otopi.plugins.gr_he_common.vm.misc misc._late_setup:65 Checking maintenance mode
2016-12-12 16:27:11 DEBUG otopi.plugins.gr_he_common.vm.misc misc._late_setup:68 hosted-engine-status: {'engine_vm_up': True, 'all_host_stats': {1: {'live-data': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=99379 (Mon Dec 12 16:26:40 2016)\nhost-id=1\nscore=3400\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n', 'hostname': 'alma04.qa.lab.tlv.redhat.com', 'host-id': 1, 'engine-status': '{"health": "good", "vm": "up", "detail": "up"}', 'score': 3400, 'stopped': False, 'maintenance': False, 'crc32': 'c315b57c', 'host-ts': 99379}}, 'engine_vm_host': 'alma04.qa.lab.tlv.redhat.com', 'global_maintenance': True}

The point is the engine-setup is checking for global-maintenance mode in the restored DB and not on the host.

So, if you really took the DB when hosted-engine-setup asked you to create a DB you are fine.
Nikolay, were you trying to restore a previous DB took outside global maintenance mode?

The open gap is that engine-backup is not able to restore a DB took on a hosted-engine env when not in maintenance mode.

(Originally by Simone Tiraboschi)

Comment 13 rhev-integ 2017-01-24 13:43:12 UTC

I've manually created db backup on engine, using " engine-backup --mode=backup --file=nsednev1 --log=Log_1" I've took db backup from engine and copied it to host's none root folder, while host was not in global maintenance, then I've set host to global maintenance.

When I was trying to make an upgrade, db had inside of it information about host, which was not in global maintenance.

(Originally by Nikolai Sednev)

Comment 14 rhev-integ 2017-01-24 13:43:17 UTC

If the user follows the instruction and he creates the backup when asked by hosted-engine-setup everything will be fine.

(Originally by Simone Tiraboschi)

Comment 15 rhev-integ 2017-01-24 13:43:23 UTC

(In reply to Simone Tiraboschi from comment #13)
> If the user follows the instruction and he creates the backup when asked by
> hosted-engine-setup everything will be fine.

That's true for migration. For backup/restore, we do not require maint during backup. Perhaps we should, so far we tried not to.

(Originally by didi)

Comment 16 rhev-integ 2017-01-24 13:43:29 UTC

IMHO the docs should be updated at https://access.redhat.com/labs/rhevupgradehelper/, the first step there is "Step 1: Stop the ovirt engine service.", while "Step 4: Disable the high-availability agents on all the self-hosted engine hosts. To do this run the following command on any host in the cluster." should come first.

(Originally by Nikolai Sednev)

Comment 17 rhev-integ 2017-01-24 13:43:35 UTC

Also after engine's db was copied to host, the engine's service should get started back, otherwise it will fail on "hosted-engine --upgrade-appliance", as service must be up on engine during the upgrade.

(Originally by Nikolai Sednev)

Comment 19 Artyom 2017-03-02 16:00:02 UTC

Verified on:
rhevm-4.0.7.4-0.1.el7ev.noarch
# rpm -qa | grep hosted
ovirt-hosted-engine-setup-2.0.4.3-3.el7ev.noarch
ovirt-hosted-engine-ha-2.0.7-2.el7ev.noarch

1. Deploy HE environment
2. Add the storage domain to the engine(to start auto-import process)
3. Wait until the engine will have HE VM
4. Backup the engine: # engine-backup --mode=backup --file=engine.backup --log=engine-backup.log
5. Copy the backup file from the HE VM to the host
6. Clean host from HE deploy(reprovisioning)
7. Run the HE deployment again
8. Answer No on the question "Automatically execute engine-setup on the engine appliance on first boot (Yes, No)[Yes]? "
9. Enter to the HE VM and copy the backup file from the host to the HE VM
10. Run restore command: # engine-backup --mode=restore --scope=all --file=engine.backup --log=engine-restore.log  --he-remove-storage-vm --he-remove-hosts --restore-permissions --provision-dwh-db --provision-db
11. Run engine setup: # engine-setup --offline
12. Finish HE deployment process

Engine UP and have HE SD and HE VM in the active state

Comment 21 errata-xmlrpc 2017-03-16 15:31:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0542.html