Description of problem: [ ERROR ] dwhd is currently running. Its hostname is FQDNofmyengine. Please stop it before running Setup. [ ERROR ] Failed to execute stage 'Transaction setup': dwhd is currently running Version-Release number of selected component (if applicable): Engine: ovirt-engine-4.1.0-0.0.master.20161201071307.gita5ff876.el7.centos.noarch Linux version 3.10.0-327.36.3.el7.x86_64 (builder.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon Oct 24 16:09:20 UTC 2016 Linux 3.10.0-327.36.3.el7.x86_64 #1 SMP Mon Oct 24 16:09:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux CentOS Linux release 7.3.1611 (Core) Host: ovirt-engine-appliance-4.1-20161202.1.el7.centos.noarch vdsm-4.18.999-1020.git1ff41b1.el7.centos.x86_64 ovirt-host-deploy-1.6.0-0.0.master.20161107121647.gitfd7ddcd.el7.centos.noarch libvirt-client-2.0.0-10.el7_3.2.x86_64 ovirt-vmconsole-1.0.4-1.el7ev.noarch ovirt-release41-pre-4.1.0-0.0.beta.20161201085255.git731841c.el7.centos.noarch qemu-kvm-rhev-2.6.0-28.el7_3.2.x86_64 ovirt-hosted-engine-ha-2.1.0-0.0.master.20161130135331.20161130135328.git3541725.el7.centos.noarch ovirt-engine-appliance-4.1-20161202.1.el7.centos.noarch sanlock-3.4.0-1.el7.x86_64 ovirt-setup-lib-1.1.0-0.0.master.20161107100014.gitb73abeb.el7.centos.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch ovirt-imageio-common-0.5.0-0.201611201242.gitb02532b.el7.centos.noarch ovirt-vmconsole-host-1.0.4-1.el7ev.noarch mom-0.5.8-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.0-0.0.master.20161130101611.gitb3ad261.el7.centos.noarch ovirt-imageio-daemon-0.5.0-0.201611201242.gitb02532b.el7.centos.noarch Linux version 3.10.0-514.2.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Wed Nov 16 13:15:13 EST 2016 Linux 3.10.0-514.2.2.el7.x86_64 #1 SMP Wed Nov 16 13:15:13 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) How reproducible: 100% Steps to Reproduce: 1.Deploy HE over NFS on RHEL7.3 host from ovirt-engine-appliance-4.1-20161202.1.el7.centos.noarch. 2.Once deployed, set host to global maintenance and then on engine run "yum install http://resources.ovirt.org/pub/yum-repo/ovirt-release41-pre.rpm -y && yum clean all && yum update -y --skip-broken" 3.Once finished, run engine-setup on engine. Actual results: engine-setup fails with "[ ERROR ] dwhd is currently running. Its hostname is FQDNofmyengine. Please stop it before running Setup. [ ERROR ] Failed to execute stage 'Transaction setup': dwhd is currently running" Expected results: engine-setup should pass without errors. Additional info:
Created attachment 1231822 [details] boot.log
Created attachment 1231823 [details] engine.log
Created attachment 1231825 [details] ovirt-engine-dwhd.log
Created attachment 1231826 [details] ovirt-engine-setup-20161214183600-ozoyih.log
Created attachment 1231827 [details] ovirt-engine-setup-20161214192609-q08i8b.log
Created attachment 1231828 [details] server.log
Please attach postgresql and yum logs. Shirly - can you please have a look? Thanks.
Created attachment 1232051 [details] yum.log
Created attachment 1232052 [details] postgresql-Wed.log
Same issue, from 4.0.6 here: https://bugzilla.redhat.com/show_bug.cgi?id=1397005.
Restoring needinfo on Shirly
Reproduced again on clean deployment. I've seen that dwhd service was not running and engine-setup failed with the same error. Work around is to restart dwh service. [ ERROR ] dwhd is currently running. Its hostname is nsednev-he-4.scl.lab.tlv.redhat.com. Please stop it before running Setup. [ ERROR ] Failed to execute stage 'Transaction setup': dwhd is currently running [ INFO ] Stage: Clean up Log file is located at /var/log/ovirt-engine/setup/ovirt-engine-setup-20161219145602-4qn6be.log [ INFO ] Generating answer file '/var/lib/ovirt-engine/setup/answers/20161219145658-setup.conf' [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination [ ERROR ] Execution of setup failed # systemctl status ovirt-engine-dwhd ● ovirt-engine-dwhd.service - oVirt Engine Data Warehouse Loaded: loaded (/usr/lib/systemd/system/ovirt-engine-dwhd.service; enabled; vendor preset: disabled) Active: inactive (dead) since Mon 2016-12-19 14:56:58 IST; 2min 47s ago Main PID: 2359 (code=exited, status=0/SUCCESS) Dec 19 13:57:56 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Starting oVirt Engine Data Warehouse... Dec 19 13:57:56 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Started oVirt Engine Data Warehouse. Dec 19 14:56:55 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Stopping oVirt Engine Data Warehouse... Dec 19 14:56:58 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Stopped oVirt Engine Data Warehouse. [root@nsednev-he-4 ~]# systemctl restart ovirt-engine-dwhd [root@nsednev-he-4 ~]# systemctl status ovirt-engine-dwhd ● ovirt-engine-dwhd.service - oVirt Engine Data Warehouse Loaded: loaded (/usr/lib/systemd/system/ovirt-engine-dwhd.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2016-12-19 15:00:00 IST; 6s ago Main PID: 27431 (ovirt-engine-dw) CGroup: /system.slice/ovirt-engine-dwhd.service ├─27431 /usr/bin/python /usr/share/ovirt-engine-dwh/services/ovirt-engine-dwhd/ovirt-engine-dwhd.py --re... └─27522 ovirt-engine-dwhd -Dorg.ovirt.engine.dwh.settings=/tmp/tmpntZi7P/settings.properties -classpath ... Dec 19 14:59:59 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Starting oVirt Engine Data Warehouse... Dec 19 15:00:00 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Started oVirt Engine Data Warehouse. # engine-setup . . . . [ INFO ] Execution of setup completed successfully
postgresql-server upgrade started at 19:03:49. This caused the postgres to shutdown at 19:04:32. This caused connection errors for engine and dwh at this time. engine-setup was run at 19:26:53 , command for dwh to stop was run at 19:27:09 and the error was at 19:27:11. 2 seconds after. Not sure if that is enough. I see that on dwh log the action was received at 19:27:10 and it started but the logs stops there. Nikolai, Does the dwh log have more logs after 19:27:10 ? Didi,Perhaps we need to increase the time for dwh to stop, to allow it to close properly.
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
I believe that after the postgres restart the dwh did reconnect properly and the problem is that it did not have enough time to properly close during the engine-setup.
Didi, Please check why the engine-setup only waited for 2 seconds for dwh to stop. I remember it should be 4 seconds.
(In reply to Shirly Radco from comment #17) > Didi, Please check why the engine-setup only waited for 2 seconds for dwh to > stop. I remember it should be 4 seconds. engine-setup does not wait at all after stopping any service, it's not its job. It's up to each service to make sure that the 'stop' command, in this case 'systemctl stop ovirt-engine-dwhd', returns only after it fully finished to stop. If you suspect that it's something in the stop process (and not in reconnect to pg), you can try debugging this also using OVIRT_SERVICE_DEBUG, see bz 1230165.
Nikolai, Please try running this as Didi suggests, using OVIRT_SERVICE_DEBUG
(In reply to Shirly Radco from comment #19) > Nikolai, Please try running this as Didi suggests, using OVIRT_SERVICE_DEBUG I will try to reproduce once again, and if successfull, will attach additional required information.
(In reply to Nikolai Sednev from comment #20) > (In reply to Shirly Radco from comment #19) > > Nikolai, Please try running this as Didi suggests, using OVIRT_SERVICE_DEBUG > > I will try to reproduce once again, and if successfull, will attach > additional required information. Hi Nikolai, I'm closing this bug for the time being. If you manage to reproduce, please re-open with additional details. I'm quite sure we are hit by this from time to time (I kind of remember community reports) but do not have root cause yet, so it'd be great if it can be reproduced with more information.
I'm reopening, since I believe this is due to the fact that dwh did not manage to stop in the time and the engine just killed it. It waited for less then 2 seconds. In remote installations we have a mechanism that retries to stop it a few times before notifying that the service was unable to stop. We need to have this in local installation as well so it manages to finish and update the engine database.
I'd like to clarify. In remote installations, we have two machines - engine on machine A and dwh on machine B. 1. When running engine-setup on A, to update the engine, it notifies the remote dwhd to stop, by writing a value to its db. The remote dwhd is checking this value, and when it's set, it exits, also setting another value, so that engine-setup on A knows that it exited. 2. When running engine-setup on machine B, to update dwh, it simply stops dwhd. 3. When both are on the same machine, the code of engine-setup dwh plugin behaves the same - it just stops dwhd, as in (2.). So when later the code in engine plugin of engine-setup runs, similarly to (1.), dwhd was already stopped. If dwhd did not update the database that it exited, for any reason, we see the behaviour of current bug (and similar) - it loops for a minute then gives up. Generally speaking, the behaviour of (2.) is reasonable - every user expects to be able to do 'service SOME_SERVICE stop' and have it cleanly stopped. If we want to change the behaviour, and make the dwh plugin of engine-setup to do the same as the engine plugin does - ask it to exit and wait until it does - we can, but: A. This is not guaranteed to solve every case. If the problem is that dwhd failed to reconnect to the db, a minute might not be enough. And we did see cases where more than a minute was not enough. B. This will break if the user manually stopped dwhd. In addition, we still do not have a 100% successful reproduction flow. So we can't know if the bug is solved.
Shirly - what do we want to do with this bug? I don't mind trying to fix - but what exactly? Is this a blocker to 4.1 GA?
This is not a blocker for the GA AFAIU. It does not happen on every upgrade/installation but only on specific cases and we do have a workaround. I think that in order to avoid this bug in the future we should use the same mechanism as in the remote installation and not base on the time the 'service SOME_SERVICE stop' reports. The parent process notifies that the service has stopped, even though the child dwh demon is still closing the threads. 'service SOME_SERVICE stop' does stop the process cleanly, but the bug is that the time it reports that it stopped is unrelated to when the child process really stopped.
Discussed this with Shirly in private a few days ago. We agreed to try and change dwhd as follows: 1. In the places that update the engine db before exiting (dwh_history_timekeeping and audit_log), try to reconnect to engine db prior to trying to update it. 2. Add more logging - before and after (re)connecting to engine db, etc. Restoring also needinfo of comment 19. Nikolai - if you have a reproducer, please try it with OVIRT_SERVICE_DEBUG (and make sure it works with dwhd prior to actually trying to reproduce). Thanks.
(In reply to Yedidyah Bar David from comment #26) > Discussed this with Shirly in private a few days ago. > > We agreed to try and change dwhd as follows: > 1. In the places that update the engine db before exiting > (dwh_history_timekeeping and audit_log), try to reconnect to engine db prior > to trying to update it. > 2. Add more logging - before and after (re)connecting to engine db, etc. > > Restoring also needinfo of comment 19. Nikolai - if you have a reproducer, > please try it with OVIRT_SERVICE_DEBUG (and make sure it works with dwhd > prior to actually trying to reproduce). Thanks. Not being reproduced any more.
I was not able to reproduce this issue in: ovirt-engine-dwh-setup-4.1.1-1.el7ev.noarch ovirt-engine-dwh-4.1.1-1.el7ev.noarch any chance you were able to reproduce this after the patch? If not we can move this to verified, as we dont have proper verification steps, and reopeen in case this issue re-appears.
I did not seen this being reproduced for a while.
moving to VERIFIED based on previous comments
This bug happened to me several times personally. The "reproducer" was: 1. Setup engine+dwh 2. Wait $RANDOM1 3. Restart dwhd 4. Wait $RANDOM2 5. Restart postgresql 6. engine-setup --offline I did not spend enough time to get to accurate concrete RANDOM1 and RANDOM2 that trigger it.
*** Bug 1447347 has been marked as a duplicate of this bug. ***