+++ This bug is an upstream to downstream clone. The original bug is: +++ +++ bug 1404812 +++ ====================================================================== Description of problem: [ ERROR ] dwhd is currently running. Its hostname is FQDNofmyengine. Please stop it before running Setup. [ ERROR ] Failed to execute stage 'Transaction setup': dwhd is currently running Version-Release number of selected component (if applicable): Engine: ovirt-engine-4.1.0-0.0.master.20161201071307.gita5ff876.el7.centos.noarch Linux version 3.10.0-327.36.3.el7.x86_64 (builder.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon Oct 24 16:09:20 UTC 2016 Linux 3.10.0-327.36.3.el7.x86_64 #1 SMP Mon Oct 24 16:09:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux CentOS Linux release 7.3.1611 (Core) Host: ovirt-engine-appliance-4.1-20161202.1.el7.centos.noarch vdsm-4.18.999-1020.git1ff41b1.el7.centos.x86_64 ovirt-host-deploy-1.6.0-0.0.master.20161107121647.gitfd7ddcd.el7.centos.noarch libvirt-client-2.0.0-10.el7_3.2.x86_64 ovirt-vmconsole-1.0.4-1.el7ev.noarch ovirt-release41-pre-4.1.0-0.0.beta.20161201085255.git731841c.el7.centos.noarch qemu-kvm-rhev-2.6.0-28.el7_3.2.x86_64 ovirt-hosted-engine-ha-2.1.0-0.0.master.20161130135331.20161130135328.git3541725.el7.centos.noarch ovirt-engine-appliance-4.1-20161202.1.el7.centos.noarch sanlock-3.4.0-1.el7.x86_64 ovirt-setup-lib-1.1.0-0.0.master.20161107100014.gitb73abeb.el7.centos.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch ovirt-imageio-common-0.5.0-0.201611201242.gitb02532b.el7.centos.noarch ovirt-vmconsole-host-1.0.4-1.el7ev.noarch mom-0.5.8-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.0-0.0.master.20161130101611.gitb3ad261.el7.centos.noarch ovirt-imageio-daemon-0.5.0-0.201611201242.gitb02532b.el7.centos.noarch Linux version 3.10.0-514.2.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Wed Nov 16 13:15:13 EST 2016 Linux 3.10.0-514.2.2.el7.x86_64 #1 SMP Wed Nov 16 13:15:13 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) How reproducible: 100% Steps to Reproduce: 1.Deploy HE over NFS on RHEL7.3 host from ovirt-engine-appliance-4.1-20161202.1.el7.centos.noarch. 2.Once deployed, set host to global maintenance and then on engine run "yum install http://resources.ovirt.org/pub/yum-repo/ovirt-release41-pre.rpm -y && yum clean all && yum update -y --skip-broken" 3.Once finished, run engine-setup on engine. Actual results: engine-setup fails with "[ ERROR ] dwhd is currently running. Its hostname is FQDNofmyengine. Please stop it before running Setup. [ ERROR ] Failed to execute stage 'Transaction setup': dwhd is currently running" Expected results: engine-setup should pass without errors. Additional info: (Originally by Nikolai Sednev)
Created attachment 1231822 [details] boot.log (Originally by Nikolai Sednev)
Created attachment 1231823 [details] engine.log (Originally by Nikolai Sednev)
Created attachment 1231825 [details] ovirt-engine-dwhd.log (Originally by Nikolai Sednev)
Created attachment 1231826 [details] ovirt-engine-setup-20161214183600-ozoyih.log (Originally by Nikolai Sednev)
Created attachment 1231827 [details] ovirt-engine-setup-20161214192609-q08i8b.log (Originally by Nikolai Sednev)
Created attachment 1231828 [details] server.log (Originally by Nikolai Sednev)
Please attach postgresql and yum logs. Shirly - can you please have a look? Thanks. (Originally by didi)
Created attachment 1232051 [details] yum.log (Originally by Nikolai Sednev)
Created attachment 1232052 [details] postgresql-Wed.log (Originally by Nikolai Sednev)
Same issue, from 4.0.6 here: https://bugzilla.redhat.com/show_bug.cgi?id=1397005. (Originally by Nikolai Sednev)
Restoring needinfo on Shirly (Originally by Sandro Bonazzola)
Reproduced again on clean deployment. I've seen that dwhd service was not running and engine-setup failed with the same error. Work around is to restart dwh service. [ ERROR ] dwhd is currently running. Its hostname is nsednev-he-4.scl.lab.tlv.redhat.com. Please stop it before running Setup. [ ERROR ] Failed to execute stage 'Transaction setup': dwhd is currently running [ INFO ] Stage: Clean up Log file is located at /var/log/ovirt-engine/setup/ovirt-engine-setup-20161219145602-4qn6be.log [ INFO ] Generating answer file '/var/lib/ovirt-engine/setup/answers/20161219145658-setup.conf' [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination [ ERROR ] Execution of setup failed # systemctl status ovirt-engine-dwhd ● ovirt-engine-dwhd.service - oVirt Engine Data Warehouse Loaded: loaded (/usr/lib/systemd/system/ovirt-engine-dwhd.service; enabled; vendor preset: disabled) Active: inactive (dead) since Mon 2016-12-19 14:56:58 IST; 2min 47s ago Main PID: 2359 (code=exited, status=0/SUCCESS) Dec 19 13:57:56 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Starting oVirt Engine Data Warehouse... Dec 19 13:57:56 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Started oVirt Engine Data Warehouse. Dec 19 14:56:55 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Stopping oVirt Engine Data Warehouse... Dec 19 14:56:58 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Stopped oVirt Engine Data Warehouse. [root@nsednev-he-4 ~]# systemctl restart ovirt-engine-dwhd [root@nsednev-he-4 ~]# systemctl status ovirt-engine-dwhd ● ovirt-engine-dwhd.service - oVirt Engine Data Warehouse Loaded: loaded (/usr/lib/systemd/system/ovirt-engine-dwhd.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2016-12-19 15:00:00 IST; 6s ago Main PID: 27431 (ovirt-engine-dw) CGroup: /system.slice/ovirt-engine-dwhd.service ├─27431 /usr/bin/python /usr/share/ovirt-engine-dwh/services/ovirt-engine-dwhd/ovirt-engine-dwhd.py --re... └─27522 ovirt-engine-dwhd -Dorg.ovirt.engine.dwh.settings=/tmp/tmpntZi7P/settings.properties -classpath ... Dec 19 14:59:59 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Starting oVirt Engine Data Warehouse... Dec 19 15:00:00 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Started oVirt Engine Data Warehouse. # engine-setup . . . . [ INFO ] Execution of setup completed successfully (Originally by Nikolai Sednev)
postgresql-server upgrade started at 19:03:49. This caused the postgres to shutdown at 19:04:32. This caused connection errors for engine and dwh at this time. engine-setup was run at 19:26:53 , command for dwh to stop was run at 19:27:09 and the error was at 19:27:11. 2 seconds after. Not sure if that is enough. I see that on dwh log the action was received at 19:27:10 and it started but the logs stops there. Nikolai, Does the dwh log have more logs after 19:27:10 ? Didi,Perhaps we need to increase the time for dwh to stop, to allow it to close properly. (Originally by Shirly Radco)
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. (Originally by rule-engine)
I believe that after the postgres restart the dwh did reconnect properly and the problem is that it did not have enough time to properly close during the engine-setup. (Originally by Shirly Radco)
Didi, Please check why the engine-setup only waited for 2 seconds for dwh to stop. I remember it should be 4 seconds. (Originally by Shirly Radco)
(In reply to Shirly Radco from comment #17) > Didi, Please check why the engine-setup only waited for 2 seconds for dwh to > stop. I remember it should be 4 seconds. engine-setup does not wait at all after stopping any service, it's not its job. It's up to each service to make sure that the 'stop' command, in this case 'systemctl stop ovirt-engine-dwhd', returns only after it fully finished to stop. If you suspect that it's something in the stop process (and not in reconnect to pg), you can try debugging this also using OVIRT_SERVICE_DEBUG, see bz 1230165. (Originally by didi)
Nikolai, Please try running this as Didi suggests, using OVIRT_SERVICE_DEBUG (Originally by Shirly Radco)
(In reply to Shirly Radco from comment #19) > Nikolai, Please try running this as Didi suggests, using OVIRT_SERVICE_DEBUG I will try to reproduce once again, and if successfull, will attach additional required information. (Originally by Nikolai Sednev)
(In reply to Nikolai Sednev from comment #20) > (In reply to Shirly Radco from comment #19) > > Nikolai, Please try running this as Didi suggests, using OVIRT_SERVICE_DEBUG > > I will try to reproduce once again, and if successfull, will attach > additional required information. Hi Nikolai, I'm closing this bug for the time being. If you manage to reproduce, please re-open with additional details. I'm quite sure we are hit by this from time to time (I kind of remember community reports) but do not have root cause yet, so it'd be great if it can be reproduced with more information. (Originally by Yaniv Kaul)
I'm reopening, since I believe this is due to the fact that dwh did not manage to stop in the time and the engine just killed it. It waited for less then 2 seconds. In remote installations we have a mechanism that retries to stop it a few times before notifying that the service was unable to stop. We need to have this in local installation as well so it manages to finish and update the engine database. (Originally by Shirly Radco)
I'd like to clarify. In remote installations, we have two machines - engine on machine A and dwh on machine B. 1. When running engine-setup on A, to update the engine, it notifies the remote dwhd to stop, by writing a value to its db. The remote dwhd is checking this value, and when it's set, it exits, also setting another value, so that engine-setup on A knows that it exited. 2. When running engine-setup on machine B, to update dwh, it simply stops dwhd. 3. When both are on the same machine, the code of engine-setup dwh plugin behaves the same - it just stops dwhd, as in (2.). So when later the code in engine plugin of engine-setup runs, similarly to (1.), dwhd was already stopped. If dwhd did not update the database that it exited, for any reason, we see the behaviour of current bug (and similar) - it loops for a minute then gives up. Generally speaking, the behaviour of (2.) is reasonable - every user expects to be able to do 'service SOME_SERVICE stop' and have it cleanly stopped. If we want to change the behaviour, and make the dwh plugin of engine-setup to do the same as the engine plugin does - ask it to exit and wait until it does - we can, but: A. This is not guaranteed to solve every case. If the problem is that dwhd failed to reconnect to the db, a minute might not be enough. And we did see cases where more than a minute was not enough. B. This will break if the user manually stopped dwhd. In addition, we still do not have a 100% successful reproduction flow. So we can't know if the bug is solved. (Originally by didi)
Shirly - what do we want to do with this bug? I don't mind trying to fix - but what exactly? Is this a blocker to 4.1 GA? (Originally by Yaniv Kaul)
This is not a blocker for the GA AFAIU. It does not happen on every upgrade/installation but only on specific cases and we do have a workaround. I think that in order to avoid this bug in the future we should use the same mechanism as in the remote installation and not base on the time the 'service SOME_SERVICE stop' reports. The parent process notifies that the service has stopped, even though the child dwh demon is still closing the threads. 'service SOME_SERVICE stop' does stop the process cleanly, but the bug is that the time it reports that it stopped is unrelated to when the child process really stopped. (Originally by Shirly Radco)
Discussed this with Shirly in private a few days ago. We agreed to try and change dwhd as follows: 1. In the places that update the engine db before exiting (dwh_history_timekeeping and audit_log), try to reconnect to engine db prior to trying to update it. 2. Add more logging - before and after (re)connecting to engine db, etc. Restoring also needinfo of comment 19. Nikolai - if you have a reproducer, please try it with OVIRT_SERVICE_DEBUG (and make sure it works with dwhd prior to actually trying to reproduce). Thanks. (Originally by didi)
(In reply to Yedidyah Bar David from comment #26) > Discussed this with Shirly in private a few days ago. > > We agreed to try and change dwhd as follows: > 1. In the places that update the engine db before exiting > (dwh_history_timekeeping and audit_log), try to reconnect to engine db prior > to trying to update it. > 2. Add more logging - before and after (re)connecting to engine db, etc. > > Restoring also needinfo of comment 19. Nikolai - if you have a reproducer, > please try it with OVIRT_SERVICE_DEBUG (and make sure it works with dwhd > prior to actually trying to reproduce). Thanks. Not being reproduced any more. (Originally by Nikolai Sednev)
Was not able to reproduce the issue in the latest build. If you encounter this issue feel free to reopen. package version: ovirt-engine-dwh-setup 4.0.7-2.el7ev.noarch
If this bug requires doc text for errata release, please provide draft text in the doc text field in the following format: Cause: Consequence: Fix: Result: The documentation team will review, edit, and approve the text.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0540.html