Created attachment 961210 [details]
engine upgrade log
Description of problem:
on my setup where engine, dwh, reports have each its own separate servers, when i tried to upgrade from vt9 to vt11 i got the following error during engine-setup on host with engine:
[ INFO ] Stage: Misc configuration
[ INFO ] Stopping DWH service on host mo-1.rhev.lab.eng.brq.redhat.com...
[ ERROR ] dwhd is currently running. Its hostname is mo-1.rhev.lab.eng.brq.redhat.com. Please stop it before running Setup.
[ ERROR ] Failed to execute stage 'Misc configuration': dwhd is currently running
[ INFO ] Yum Performing yum transaction rollback
[ INFO ] Stage: Clean up
Log file is located at /var/log/ovirt-engine/setup/ovirt-engine-setup-20141124144013-w5uq3h.log
[ INFO ] Generating answer file '/var/lib/ovirt-engine/setup/answers/20141124144140-setup.conf'
[ INFO ] Stage: Pre-termination
[ INFO ] Stage: Termination
[ ERROR ] Execution of setup failed
when this happened first time, dwhd was running, so i tried to stop it manually and i got the same error
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. install engine on host 'a', dwh on host 'b' and reports on host 'c' - older version
2. change repos to new version and upgrade relevant packages on all hosts:
rhevm-setup on host 'a', rhevm-dwh-seutp on host 'b', rhevm-reports-setup on host 'c'
3. run engine-setup on host 'a'
update should pass
Created attachment 961211 [details]
Did you first upgrade to 3.5 and then moved the dwh and reports to separate host?
If not, Please follow the upgrade procedure as documented in :
this was installed as a setup on separate hosts from the beginning, and i'm upgrading from older build of 3.5 to newer build in order to test if it will work for zstream updates once released
But you stated in steps to reproduce that you have installed in hosts b and c older version.
They should have latest repos.
To test the z-stream upgrade, run engine-setup again after you installed it successfully the first time.
Not sure what you used as "older" version but it might have bugs that were already fixed.
Didi, do you agree?
Michal, how did you stop the dwh service?
i stopped dwh service with "service ovirt-engine-dwhd stop"... that, and "/etc/init.d/ovirt-engine-dwhd stop" are the only correct ways to stop a service in a system that uses sysV init
older version means older version of 3.5 - i used build vt9...
and yes, i installed it successfully the first time, but on a bit older build - this should be correct testing scenario
if there is some bug in vt9 making update to vt11 fail that was already fixed, please tell me and i will wait for vt12 before verifying https://bugzilla.redhat.com/show_bug.cgi?id=1118322 and https://bugzilla.redhat.com/show_bug.cgi?id=1100205 and testing if zstream update could work
i don't think running engine-setup with same build is sufficient test because setup behaves differently if there are new packages (downloading and installing new rpms, etc.)
Results of analysis on the machines of the reporter (Thanks, Michal!):
1. If for some reason dwhd looses contact with the engine db, it more-or-less "hangs up". That is, it does not exit, nor try to reconnect.
Not sure what was the root cause in this specific case. First error in the log was:
2014-11-04 10:38:10|DswxDN|3L7BLF|f1VhYd|OVIRT_ENGINE_DWH|ConfigurationSync|Default|6|Java Exception|tJDBCOutput_9|org.postgresql.util.PSQLException:FATAL: terminating connection due to administrator command|1
When I later tried to reproduce by restarting pg on engine db, I got on dwh a different error:
2014-11-26 10:45:11|YtiyXa|YtiyXa|YtiyXa|OVIRT_ENGINE_DWH|HistoryETL|Default|6|Java Exception|tJDBCInput_1|java.lang.NullPointerException:null|1
2014-11-26 10:46:00|dP658j|YtiyXa|3wzY5W|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|Default|6|Java Exception|tJDBCConnection_3|org.postgresql.util.PSQLException:Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.|1
I'll attach all logs.
Since this was the root cause for this bug, I changed the summary accordingly. We opened other bugs for the next steps.
It might be that dwhd does try to reconnect but somehow this does not work well. Not sure. Perhaps the best solution will be to just exit with a message to the log.
2. Running engine-setup on engine side will try to disconnect dwh by setting DisconnectDwh to 1, waiting a bit, then failing. That's the report in this bug's description. It will not set it to 0 before failing, see bug 1168160 for that.
3. Manually restarting dwhd then will make it exit, because it sees that DisconnectDwh is 1. Currently it does not log this, see bug 1168141.
reset DisconnectDwh by running on engine db:
update vdc_options set option_value='0' where option_name = 'DisconnectDwh';
and then restart dwhd.
Lowering severity because there is a workaround.
Moving to infra since it seems an issue within the dwhd daemon.
Actually the workaround in comment 10 is for bug 1168160. For this bug, restarting dwhd should be enough (after fixing the root cause preventing it from accessing the engine's db).
Created attachment 961653 [details]
This bug has missed the release notes cut-off date and will be excluded from the release notes.
rhev 3.5.0 was released. closing.
(In reply to Yedidyah Bar David from comment #13)
> Actually the workaround in comment 10 is for bug 1168160. For this bug,
> restarting dwhd should be enough (after fixing the root cause preventing it
> from accessing the engine's db).
I confirm this : I was upgrading from 126.96.36.199-1.el6 to 3.5.4.
engine-setup broke with the error above ([ ERROR ] dwhd is currently running).
I did nothing else than "service ovirt-engine-dwhd restart", then run again engine-setup.
I did NOT run any SQL query.
engine-setup restarting well.
This bug still occurs when attempting to upgrade from ovirt 4.0 to 4.1.
(In reply to Michael Watters from comment #19)
> This bug still occurs when attempting to upgrade from ovirt 4.0 to 4.1.
Can you please open a new bug on oVirt 4.1 and attach log-collector report to it?
Michael opened bug 1447347.