Created attachment 1019696 [details] logs Description of problem: when running engine-setup after a restore of engine with engine-backup (that is - yum install rhevm rhevm-dwh rhevm-reports on cleaned OS and then engine-backup --mode=restore...), engine-setup fails as it tries to stop dwhd but it is - of course - not running. [ INFO ] Cleaning async tasks and compensations [ INFO ] Checking the Engine database consistency [ INFO ] Stage: Transaction setup [ INFO ] Stopping dwh service [ INFO ] Stopping reports service [ INFO ] Stopping engine service [ INFO ] Stopping ovirt-fence-kdump-listener service [ INFO ] Stopping websocket-proxy service [ ERROR ] dwhd is currently running. Its hostname is jb-rhevm35.rhev.lab.eng.brq.redhat.com. Please stop it before running Setup. [ ERROR ] Failed to execute stage 'Transaction setup': dwhd is currently running [ INFO ] Yum Performing yum transaction rollback [ INFO ] Stage: Clean up Log file is located at /var/log/ovirt-engine/setup/ovirt-engine-setup-20150428161257-0ekuh0.log [ INFO ] Generating answer file '/var/lib/ovirt-engine/setup/answers/20150428161325-setup.conf' [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination [ ERROR ] Execution of setup failed 2015-04-28 16:13:25 DEBUG otopi.context context._executeMethod:138 Stage transaction-prepare METHOD otopi.plugins.ovirt_engine_setup.ovirt_engine_dwh.core.single_etl.Plugin._transaction Begin 2015-04-28 16:13:25 DEBUG otopi.ovirt_engine_setup.engine_common.database database.execute:164 Database: 'None', Statement: ' select * from GetDwhHistoryTimekeepingByVarName( %(name)s ) ', args: {'name': 'DwhCurrentlyRunning'} 2015-04-28 16:13:25 DEBUG otopi.ovirt_engine_setup.engine_common.database database.execute:169 Creating own connection 2015-04-28 16:13:25 DEBUG otopi.ovirt_engine_setup.engine_common.database database.execute:214 Result: [{'var_value': '1', 'var_datetime': None, 'var_name': 'DwhCurrentlyRunning'}] 2015-04-28 16:13:25 ERROR otopi.plugins.ovirt_engine_setup.ovirt_engine_dwh.core.single_etl single_etl._transactionBegin:132 dwhd is currently running. Its hostname is jb-rhevm35.rhev.lab.eng.brq.redhat.com. Please stop it before running Setup. 2015-04-28 16:13:25 DEBUG otopi.context context._executeMethod:152 method exception Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/otopi/context.py", line 142, in _executeMethod method['method']() File "/usr/share/ovirt-engine/setup/bin/../plugins/ovirt-engine-setup/ovirt-engine-dwh/core/single_etl.py", line 135, in _transactionBegin raise RuntimeError(_('dwhd is currently running')) RuntimeError: dwhd is currently running 2015-04-28 16:13:25 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Transaction setup': dwhd is currently running Version-Release number of selected component (if applicable): rhevm-setup-3.5.1-0.4.el6ev.noarch How reproducible: 100% Steps to Reproduce: 1. yum install rhevm rhevm-dwh rhevm-reports on clean OS 2. engine-setup + engine-backup --mode=backup ... 3. have (another, or go back with snapshot) clean OS 4. yum install rhevm rhevm-dwh rhevm-reports on clean OS 5. copy backup file on this clean OS 6. engine-backup --mode=restore ... (have fun when creating DBs/roles) 7. engine-setup Actual results: engine-setup fails Expected results: should work nicely Additional info: (workaround) /etc/init.d/ovirt-engine-dwhd start and then engine-setup works
i discovered this issue is talked in http://www.ovirt.org/Ovirt-engine-backup#DWH.2FReports "If dwhd is running during backup, 'engine-setup' after restore will emit the following error and exit: " i would expect from enterprise software to do it for me - ie. during backup or restore - and not to tell me to do some manual change in the DB.
(In reply to Jiri Belka from comment #1) > i discovered this issue is talked in > http://www.ovirt.org/Ovirt-engine-backup#DWH.2FReports > > "If dwhd is running during backup, 'engine-setup' after restore will emit > the following error and exit: " > > i would expect from enterprise software to do it for me - ie. during backup > or restore - and not to tell me to do some manual change in the DB. I agree - therefore fixed this just a few days ago :-) Note that this is a clone of bug 1158985. Leaving open for verification - that one was closed wontfix. Not sure it's worth 3.5.z - if someone thinks it does, we can reopen that one. It will not be difficult to backport, but IIRC it's not a mere cherry-pick.
ok, dwhd was not running and engine-setup finished ok after previously restored files, engine, dwh, reports. rhevm-setup-base-3.6.0.2-0.1.el6.noarch
why hasn't been this cloned to 3.5? i just faced this issue while upgrading from 3.5.4 to 3.5.5.
(In reply to Jiri Belka from comment #4) > why hasn't been this cloned to 3.5? i just faced this issue while upgrading > from 3.5.4 to 3.5.5. I guess no-one thought it's important enough. See comment 2 above.
Not sure why people suddenly start getting these errors, as we added this check in 3.5, see bug 1024028. Before 3.5, dwh could only be installed on the engine machine. So it was easy to make sure it's down - just kill it. In 3.5, we allow running them on different machines. So things are a bit more complicated. engine-setup, on the engine machine: 1. Kills dwhd at some point if it's local 2. Asks dwh to stop by setting a db flag, at a different point 3. Checks that a different flag in the db says that dwh is down dwhd should set the last flag to say it's down, when it exits, both if it exits because it was killed, and if it exited due to being asked (by the flag in (2.). In theory, everything should work. In practice, this complexity introduces more possible failure points. bug 1286441 is one case. A different case is if dwhd's connection to postgresql is dropped (say, network error, or kill -9), and then dwhd itself is stopped (so it can't set the flag saying it's stopped). I might be able to come up with even more flows, and you might too, once we understand your flows. In any case, a simple restart of dwhd should be enough if all else is ok. Some more things we can do: 1. Analyze more cases and try to prevent and/or fix them 2. Decide that if we found and killed a local dwhd, it was the only one, and we do not need to check the flag (this will probably break some less likely flows) 3. Add some opposite check, that if e.g. dwh_history_timekeeping.heartBeat is more than X seconds after we asked it to stop and/or killed it, it actually was stopped, even if the flag says it's running. None of these are simple and/or user-friendly and/or fix all problems.
See also bug 1316973
Created bug report 1317633