1427146 – [downstream clone - 4.0.7] dwhd sometimes does not update the engine db before stopping

Bug 1427146 - [downstream clone - 4.0.7] dwhd sometimes does not update the engine db before stopping

Summary: [downstream clone - 4.0.7] dwhd sometimes does not update the engine db befor...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine-dwh
Sub Component:
Version:	unspecified
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	ovirt-4.0.7
Target Release:	---
Assignee:	Shirly Radco
QA Contact:	Lukas Svaty
Docs Contact:
URL:
Whiteboard:
Depends On:	1404812
Blocks:
TreeView+	depends on / blocked

Reported:	2017-02-27 13:08 UTC by rhev-integ
Modified:	2019-04-28 13:43 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1404812
Environment:
Last Closed:	2017-03-16 15:28:41 UTC
oVirt Team:	Metrics
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0540	normal	SHIPPED_LIVE	rhevm-dwh 4.0.7 bug fix update	2017-03-16 19:24:52 UTC
oVirt gerrit	73119	master	MERGED	history: update dwh stop process	2020-07-08 09:33:42 UTC
oVirt gerrit	73149	ovirt-engine-dwh-4.1	MERGED	history: update dwh stop process	2020-07-08 09:33:42 UTC
oVirt gerrit	73153	None	MERGED	history: update dwh stop process	2020-07-08 09:33:42 UTC

Description rhev-integ 2017-02-27 13:08:03 UTC

+++ This bug is an upstream to downstream clone. The original bug is: +++
+++   bug 1404812 +++
======================================================================

Description of problem:
[ ERROR ] dwhd is currently running. Its hostname is FQDNofmyengine. Please stop it before running Setup.
[ ERROR ] Failed to execute stage 'Transaction setup': dwhd is currently running


Version-Release number of selected component (if applicable):
Engine:
ovirt-engine-4.1.0-0.0.master.20161201071307.gita5ff876.el7.centos.noarch
Linux version 3.10.0-327.36.3.el7.x86_64 (builder.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon Oct 24 16:09:20 UTC 2016
Linux 3.10.0-327.36.3.el7.x86_64 #1 SMP Mon Oct 24 16:09:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
CentOS Linux release 7.3.1611 (Core) 

Host:
ovirt-engine-appliance-4.1-20161202.1.el7.centos.noarch
vdsm-4.18.999-1020.git1ff41b1.el7.centos.x86_64
ovirt-host-deploy-1.6.0-0.0.master.20161107121647.gitfd7ddcd.el7.centos.noarch
libvirt-client-2.0.0-10.el7_3.2.x86_64
ovirt-vmconsole-1.0.4-1.el7ev.noarch
ovirt-release41-pre-4.1.0-0.0.beta.20161201085255.git731841c.el7.centos.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.2.x86_64
ovirt-hosted-engine-ha-2.1.0-0.0.master.20161130135331.20161130135328.git3541725.el7.centos.noarch
ovirt-engine-appliance-4.1-20161202.1.el7.centos.noarch
sanlock-3.4.0-1.el7.x86_64
ovirt-setup-lib-1.1.0-0.0.master.20161107100014.gitb73abeb.el7.centos.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
ovirt-imageio-common-0.5.0-0.201611201242.gitb02532b.el7.centos.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
mom-0.5.8-1.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0-0.0.master.20161130101611.gitb3ad261.el7.centos.noarch
ovirt-imageio-daemon-0.5.0-0.201611201242.gitb02532b.el7.centos.noarch
Linux version 3.10.0-514.2.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Wed Nov 16 13:15:13 EST 2016
Linux 3.10.0-514.2.2.el7.x86_64 #1 SMP Wed Nov 16 13:15:13 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)


How reproducible:
100%

Steps to Reproduce:
1.Deploy HE over NFS on RHEL7.3 host from ovirt-engine-appliance-4.1-20161202.1.el7.centos.noarch.
2.Once deployed, set host to global maintenance and then on engine run "yum install http://resources.ovirt.org/pub/yum-repo/ovirt-release41-pre.rpm -y && yum clean all && yum update -y --skip-broken"
3.Once finished, run engine-setup on engine.

Actual results:
engine-setup fails with "[ ERROR ] dwhd is currently running. Its hostname is FQDNofmyengine. Please stop it before running Setup.
[ ERROR ] Failed to execute stage 'Transaction setup': dwhd is currently running"

Expected results:
engine-setup should pass without errors.

Additional info:

(Originally by Nikolai Sednev)

Comment 1 rhev-integ 2017-02-27 13:08:14 UTC

Created attachment 1231822 [details]
boot.log

(Originally by Nikolai Sednev)

Comment 3 rhev-integ 2017-02-27 13:08:20 UTC

Created attachment 1231823 [details]
engine.log

(Originally by Nikolai Sednev)

Comment 4 rhev-integ 2017-02-27 13:08:26 UTC

Created attachment 1231825 [details]
ovirt-engine-dwhd.log

(Originally by Nikolai Sednev)

Comment 5 rhev-integ 2017-02-27 13:08:33 UTC

Created attachment 1231826 [details]
ovirt-engine-setup-20161214183600-ozoyih.log

(Originally by Nikolai Sednev)

Comment 6 rhev-integ 2017-02-27 13:08:39 UTC

Created attachment 1231827 [details]
ovirt-engine-setup-20161214192609-q08i8b.log

(Originally by Nikolai Sednev)

Comment 7 rhev-integ 2017-02-27 13:08:45 UTC

Created attachment 1231828 [details]
server.log

(Originally by Nikolai Sednev)

Comment 8 rhev-integ 2017-02-27 13:08:52 UTC

Please attach postgresql and yum logs. Shirly - can you please have a look? Thanks.

(Originally by didi)

Comment 9 rhev-integ 2017-02-27 13:08:58 UTC

Created attachment 1232051 [details]
yum.log

(Originally by Nikolai Sednev)

Comment 10 rhev-integ 2017-02-27 13:09:04 UTC

Created attachment 1232052 [details]
postgresql-Wed.log

(Originally by Nikolai Sednev)

Comment 11 rhev-integ 2017-02-27 13:09:10 UTC

Same issue, from 4.0.6 here: https://bugzilla.redhat.com/show_bug.cgi?id=1397005.

(Originally by Nikolai Sednev)

Comment 12 rhev-integ 2017-02-27 13:09:17 UTC

Restoring needinfo on Shirly

(Originally by Sandro Bonazzola)

Comment 13 rhev-integ 2017-02-27 13:09:23 UTC

Reproduced again on clean deployment.
I've seen that dwhd service was not running and engine-setup failed with the same error.

Work around is to restart dwh service.

[ ERROR ] dwhd is currently running. Its hostname is nsednev-he-4.scl.lab.tlv.redhat.com. Please stop it before running Setup.
[ ERROR ] Failed to execute stage 'Transaction setup': dwhd is currently running
[ INFO  ] Stage: Clean up
          Log file is located at /var/log/ovirt-engine/setup/ovirt-engine-setup-20161219145602-4qn6be.log
[ INFO  ] Generating answer file '/var/lib/ovirt-engine/setup/answers/20161219145658-setup.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
[ ERROR ] Execution of setup failed

# systemctl status ovirt-engine-dwhd
● ovirt-engine-dwhd.service - oVirt Engine Data Warehouse
   Loaded: loaded (/usr/lib/systemd/system/ovirt-engine-dwhd.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Mon 2016-12-19 14:56:58 IST; 2min 47s ago
 Main PID: 2359 (code=exited, status=0/SUCCESS)

Dec 19 13:57:56 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Starting oVirt Engine Data Warehouse...
Dec 19 13:57:56 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Started oVirt Engine Data Warehouse.
Dec 19 14:56:55 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Stopping oVirt Engine Data Warehouse...
Dec 19 14:56:58 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Stopped oVirt Engine Data Warehouse.
[root@nsednev-he-4 ~]# systemctl restart  ovirt-engine-dwhd
[root@nsednev-he-4 ~]# systemctl status ovirt-engine-dwhd
● ovirt-engine-dwhd.service - oVirt Engine Data Warehouse
   Loaded: loaded (/usr/lib/systemd/system/ovirt-engine-dwhd.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2016-12-19 15:00:00 IST; 6s ago
 Main PID: 27431 (ovirt-engine-dw)
   CGroup: /system.slice/ovirt-engine-dwhd.service
           ├─27431 /usr/bin/python /usr/share/ovirt-engine-dwh/services/ovirt-engine-dwhd/ovirt-engine-dwhd.py --re...
           └─27522 ovirt-engine-dwhd -Dorg.ovirt.engine.dwh.settings=/tmp/tmpntZi7P/settings.properties -classpath ...

Dec 19 14:59:59 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Starting oVirt Engine Data Warehouse...
Dec 19 15:00:00 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Started oVirt Engine Data Warehouse.


# engine-setup
.
.
.
.
[ INFO  ] Execution of setup completed successfully

(Originally by Nikolai Sednev)

Comment 14 rhev-integ 2017-02-27 13:09:30 UTC

postgresql-server upgrade started at 19:03:49. This caused the postgres to shutdown at 19:04:32. This caused connection errors for engine and dwh at this time.

engine-setup was run at 19:26:53 , command for dwh to stop was run at 19:27:09 and the error was at 19:27:11. 2 seconds after.
Not sure if that is enough.
I see that on dwh log the action was received at 19:27:10 and it started but the logs stops there. 

Nikolai, Does the dwh log have more logs after 19:27:10 ?

Didi,Perhaps we need to increase the time for dwh to stop, to allow it to close properly.

(Originally by Shirly Radco)

Comment 15 rhev-integ 2017-02-27 13:09:37 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

(Originally by rule-engine)

Comment 16 rhev-integ 2017-02-27 13:09:43 UTC

I believe that after the postgres restart the dwh did reconnect properly and the problem is that it did not have enough time to properly close during the engine-setup.

(Originally by Shirly Radco)

Comment 18 rhev-integ 2017-02-27 13:09:55 UTC

Didi, Please check why the engine-setup only waited for 2 seconds for dwh to stop. I remember it should be 4 seconds.

(Originally by Shirly Radco)

Comment 19 rhev-integ 2017-02-27 13:10:01 UTC

(In reply to Shirly Radco from comment #17)
> Didi, Please check why the engine-setup only waited for 2 seconds for dwh to
> stop. I remember it should be 4 seconds.

engine-setup does not wait at all after stopping any service, it's not its job.

It's up to each service to make sure that the 'stop' command, in this case 'systemctl stop ovirt-engine-dwhd', returns only after it fully finished to stop.

If you suspect that it's something in the stop process (and not in reconnect to pg), you can try debugging this also using OVIRT_SERVICE_DEBUG, see bz 1230165.

(Originally by didi)

Comment 20 rhev-integ 2017-02-27 13:10:08 UTC

Nikolai, Please try running this as  Didi suggests, using OVIRT_SERVICE_DEBUG

(Originally by Shirly Radco)

Comment 21 rhev-integ 2017-02-27 13:10:14 UTC

(In reply to Shirly Radco from comment #19)
> Nikolai, Please try running this as  Didi suggests, using OVIRT_SERVICE_DEBUG

I will try to reproduce once again, and if successfull, will attach additional required information.

(Originally by Nikolai Sednev)

Comment 22 rhev-integ 2017-02-27 13:10:20 UTC

(In reply to Nikolai Sednev from comment #20)
> (In reply to Shirly Radco from comment #19)
> > Nikolai, Please try running this as  Didi suggests, using OVIRT_SERVICE_DEBUG
> 
> I will try to reproduce once again, and if successfull, will attach
> additional required information.

Hi Nikolai,

I'm closing this bug for the time being. If you manage to reproduce, please re-open with additional details. I'm quite sure we are hit by this from time to time (I kind of remember community reports) but do not have root cause yet, so it'd be great if it can be reproduced with more information.

(Originally by Yaniv Kaul)

Comment 23 rhev-integ 2017-02-27 13:10:27 UTC

I'm reopening, since I believe this is due to the fact that dwh did not manage to stop in the time and the engine just killed it. It waited for less then 2 seconds.
In remote installations we have a mechanism that retries to stop it a few times before notifying that the service was unable to stop.
We need to have this in local installation as well so it manages to finish and update the engine database.

(Originally by Shirly Radco)

Comment 24 rhev-integ 2017-02-27 13:10:33 UTC

I'd like to clarify. In remote installations, we have two machines - engine on machine A and dwh on machine B.

1. When running engine-setup on A, to update the engine, it notifies the remote dwhd to stop, by writing a value to its db. The remote dwhd is checking this value, and when it's set, it exits, also setting another value, so that engine-setup on A knows that it exited.

2. When running engine-setup on machine B, to update dwh, it simply stops dwhd.

3. When both are on the same machine, the code of engine-setup dwh plugin behaves the same - it just stops dwhd, as in (2.). So when later the code in engine plugin of engine-setup runs, similarly to (1.), dwhd was already stopped. If dwhd did not update the database that it exited, for any reason, we see the behaviour of current bug (and similar) - it loops for a minute then gives up.

Generally speaking, the behaviour of (2.) is reasonable - every user expects to be able to do 'service SOME_SERVICE stop' and have it cleanly stopped.

If we want to change the behaviour, and make the dwh plugin of engine-setup to do the same as the engine plugin does - ask it to exit and wait until it does - we can, but: A. This is not guaranteed to solve every case. If the problem is that dwhd failed to reconnect to the db, a minute might not be enough. And we did see cases where more than a minute was not enough. B. This will break if the user manually stopped dwhd.

In addition, we still do not have a 100% successful reproduction flow. So we can't know if the bug is solved.

(Originally by didi)

Comment 25 rhev-integ 2017-02-27 13:10:39 UTC

Shirly - what do we want to do with this bug?
I don't mind trying to fix - but what exactly?
Is this a blocker to 4.1 GA?

(Originally by Yaniv Kaul)

Comment 26 rhev-integ 2017-02-27 13:10:46 UTC

This is not a blocker for the GA AFAIU.
It does not happen on every upgrade/installation but only on specific cases and we do have a workaround.

I think that in order to avoid this bug in the future we should use the same mechanism as in the remote installation and not base on the time the 'service SOME_SERVICE stop' reports.

The parent process notifies that the service has stopped, even though the child dwh demon is still closing the threads.

'service SOME_SERVICE stop' does stop the process cleanly, but the bug is that the time it reports that it stopped is unrelated to when the child process really stopped.

(Originally by Shirly Radco)

Comment 27 rhev-integ 2017-02-27 13:10:52 UTC

Discussed this with Shirly in private a few days ago.

We agreed to try and change dwhd as follows:
1. In the places that update the engine db before exiting (dwh_history_timekeeping and audit_log), try to reconnect to engine db prior to trying to update it.
2. Add more logging - before and after (re)connecting to engine db, etc.

Restoring also needinfo of comment 19. Nikolai - if you have a reproducer, please try it with OVIRT_SERVICE_DEBUG (and make sure it works with dwhd prior to actually trying to reproduce). Thanks.

(Originally by didi)

Comment 28 rhev-integ 2017-02-27 13:10:58 UTC

(In reply to Yedidyah Bar David from comment #26)
> Discussed this with Shirly in private a few days ago.
> 
> We agreed to try and change dwhd as follows:
> 1. In the places that update the engine db before exiting
> (dwh_history_timekeeping and audit_log), try to reconnect to engine db prior
> to trying to update it.
> 2. Add more logging - before and after (re)connecting to engine db, etc.
> 
> Restoring also needinfo of comment 19. Nikolai - if you have a reproducer,
> please try it with OVIRT_SERVICE_DEBUG (and make sure it works with dwhd
> prior to actually trying to reproduce). Thanks.

Not being reproduced any more.

(Originally by Nikolai Sednev)

Comment 32 Lukas Svaty 2017-03-02 18:09:06 UTC

Was not able to reproduce the issue in the latest build. If you encounter this issue feel free to reopen.

package version: 
ovirt-engine-dwh-setup 	4.0.7-2.el7ev.noarch

Comment 33 Byron Gravenorst 2017-03-06 00:25:58 UTC

If this bug requires doc text for errata release, please provide draft text in the doc text field in the following format:
Cause:
Consequence:
Fix:
Result:
The documentation team will review, edit, and approve the text.

Comment 35 errata-xmlrpc 2017-03-16 15:28:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0540.html

Note You need to log in before you can comment on or make changes to this bug.