Bug 1216125 - after engine-backup --mode=restore, engine-setup fails with: dwhd is currently running
Summary: after engine-backup --mode=restore, engine-setup fails with: dwhd is currentl...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.5.1
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-3.6.0-rc
: 3.6.0
Assignee: Yedidyah Bar David
QA Contact: Jiri Belka
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-04-28 14:26 UTC by Jiri Belka
Modified: 2021-08-30 11:45 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: dwhd notes whether it's up by setting some flag in the engine db. engine-backup kept this flag on backup and restored it on restore. Consequence: engine-setup ran after a restore of engine-backup, noted that dwhd is "up" (the flag was set) and failed. Fix: engine-backup --mode=restore was changed to clear this flag. Result: Now a sequence of engine-backup --mode=backup with dwhd up, followed by engine-backup --mode=restore and then engine-setup, does not fail. Note to doc team - IIRC this is already mentioned somewhere in docs. Please check and remove content if needed.
Clone Of:
Environment:
Last Closed: 2016-03-11 07:32:09 UTC
oVirt Team: Integration
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (40.72 KB, application/x-gzip)
2015-04-28 14:26 UTC, Jiri Belka
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-43179 0 None None None 2021-08-30 11:45:54 UTC
Red Hat Knowledge Base (Solution) 1283133 0 None None None 2016-09-02 20:22:40 UTC
oVirt gerrit 40348 0 None None None Never

Description Jiri Belka 2015-04-28 14:26:39 UTC
Created attachment 1019696 [details]
logs

Description of problem:

when running engine-setup after a restore of engine with engine-backup (that is - yum install rhevm rhevm-dwh rhevm-reports on cleaned OS and then engine-backup --mode=restore...), engine-setup fails as it tries to stop dwhd but it is - of course - not running.

[ INFO  ] Cleaning async tasks and compensations
[ INFO  ] Checking the Engine database consistency
[ INFO  ] Stage: Transaction setup
[ INFO  ] Stopping dwh service
[ INFO  ] Stopping reports service
[ INFO  ] Stopping engine service
[ INFO  ] Stopping ovirt-fence-kdump-listener service
[ INFO  ] Stopping websocket-proxy service
[ ERROR ] dwhd is currently running. Its hostname is jb-rhevm35.rhev.lab.eng.brq.redhat.com. Please stop it before running Setup.
[ ERROR ] Failed to execute stage 'Transaction setup': dwhd is currently running
[ INFO  ] Yum Performing yum transaction rollback
[ INFO  ] Stage: Clean up
          Log file is located at /var/log/ovirt-engine/setup/ovirt-engine-setup-20150428161257-0ekuh0.log
[ INFO  ] Generating answer file '/var/lib/ovirt-engine/setup/answers/20150428161325-setup.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
[ ERROR ] Execution of setup failed

2015-04-28 16:13:25 DEBUG otopi.context context._executeMethod:138 Stage transaction-prepare METHOD otopi.plugins.ovirt_engine_setup.ovirt_engine_dwh.core.single_etl.Plugin._transaction
Begin
2015-04-28 16:13:25 DEBUG otopi.ovirt_engine_setup.engine_common.database database.execute:164 Database: 'None', Statement: '
            select * from GetDwhHistoryTimekeepingByVarName(
                %(name)s
            )
        ', args: {'name': 'DwhCurrentlyRunning'}
2015-04-28 16:13:25 DEBUG otopi.ovirt_engine_setup.engine_common.database database.execute:169 Creating own connection
2015-04-28 16:13:25 DEBUG otopi.ovirt_engine_setup.engine_common.database database.execute:214 Result: [{'var_value': '1', 'var_datetime': None, 'var_name': 'DwhCurrentlyRunning'}]
2015-04-28 16:13:25 ERROR otopi.plugins.ovirt_engine_setup.ovirt_engine_dwh.core.single_etl single_etl._transactionBegin:132 dwhd is currently running.
Its hostname is jb-rhevm35.rhev.lab.eng.brq.redhat.com.
Please stop it before running Setup.
2015-04-28 16:13:25 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/otopi/context.py", line 142, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-engine/setup/bin/../plugins/ovirt-engine-setup/ovirt-engine-dwh/core/single_etl.py", line 135, in _transactionBegin
    raise RuntimeError(_('dwhd is currently running'))
RuntimeError: dwhd is currently running
2015-04-28 16:13:25 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Transaction setup': dwhd is currently running


Version-Release number of selected component (if applicable):
rhevm-setup-3.5.1-0.4.el6ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. yum install rhevm rhevm-dwh rhevm-reports on clean OS
2. engine-setup + engine-backup --mode=backup ...
3. have (another, or go back with snapshot) clean OS
4. yum install rhevm rhevm-dwh rhevm-reports on clean OS
5. copy backup file on this clean OS
6. engine-backup --mode=restore ... (have fun when creating DBs/roles)
7. engine-setup

Actual results:
engine-setup fails

Expected results:
should work nicely

Additional info:
(workaround) /etc/init.d/ovirt-engine-dwhd start and then engine-setup works

Comment 1 Jiri Belka 2015-04-30 10:23:56 UTC
i discovered this issue is talked in http://www.ovirt.org/Ovirt-engine-backup#DWH.2FReports

"If dwhd is running during backup, 'engine-setup' after restore will emit the following error and exit: "

i would expect from enterprise software to do it for me - ie. during backup or restore - and not to tell me to do some manual change in the DB.

Comment 2 Yedidyah Bar David 2015-05-03 05:37:09 UTC
(In reply to Jiri Belka from comment #1)
> i discovered this issue is talked in
> http://www.ovirt.org/Ovirt-engine-backup#DWH.2FReports
> 
> "If dwhd is running during backup, 'engine-setup' after restore will emit
> the following error and exit: "
> 
> i would expect from enterprise software to do it for me - ie. during backup
> or restore - and not to tell me to do some manual change in the DB.

I agree - therefore fixed this just a few days ago :-)

Note that this is a clone of bug 1158985. Leaving open for verification - that one was closed wontfix.

Not sure it's worth 3.5.z - if someone thinks it does, we can reopen that one. It will not be difficult to backport, but IIRC it's not a mere cherry-pick.

Comment 3 Jiri Belka 2015-10-30 13:25:29 UTC
ok, dwhd was not running and engine-setup finished ok after previously restored files, engine, dwh, reports.

rhevm-setup-base-3.6.0.2-0.1.el6.noarch

Comment 4 Jiri Belka 2015-10-31 00:19:08 UTC
why hasn't been this cloned to 3.5? i just faced this issue while upgrading from 3.5.4 to 3.5.5.

Comment 5 Yedidyah Bar David 2015-11-01 10:24:24 UTC
(In reply to Jiri Belka from comment #4)
> why hasn't been this cloned to 3.5? i just faced this issue while upgrading
> from 3.5.4 to 3.5.5.

I guess no-one thought it's important enough. See comment 2 above.

Comment 10 Yedidyah Bar David 2016-03-13 07:53:18 UTC
Not sure why people suddenly start getting these errors, as we added this check in 3.5, see bug 1024028.

Before 3.5, dwh could only be installed on the engine machine. So it was easy to make sure it's down - just kill it.

In 3.5, we allow running them on different machines. So things are a bit more complicated. engine-setup, on the engine machine:
1. Kills dwhd at some point if it's local
2. Asks dwh to stop by setting a db flag, at a different point
3. Checks that a different flag in the db says that dwh is down

dwhd should set the last flag to say it's down, when it exits, both if it exits because it was killed, and if it exited due to being asked (by the flag in (2.).

In theory, everything should work.

In practice, this complexity introduces more possible failure points. bug 1286441 is one case. A different case is if dwhd's connection to postgresql is dropped (say, network error, or kill -9), and then dwhd itself is stopped (so it can't set the flag saying it's stopped). I might be able to come up with even more flows, and you might too, once we understand your flows.

In any case, a simple restart of dwhd should be enough if all else is ok.

Some more things we can do:
1. Analyze more cases and try to prevent and/or fix them
2. Decide that if we found and killed a local dwhd, it was the only one, and we do not need to check the flag (this will probably break some less likely flows)
3. Add some opposite check, that if e.g. dwh_history_timekeeping.heartBeat is more than X seconds after we asked it to stop and/or killed it, it actually was stopped, even if the flag says it's running.

None of these are simple and/or user-friendly and/or fix all problems.

Comment 11 Yedidyah Bar David 2016-03-13 08:03:43 UTC
See also bug 1316973

Comment 12 Jason Woods 2016-03-14 17:40:25 UTC
Created bug report 1317633


Note You need to log in before you can comment on or make changes to this bug.