1404812 – dwhd sometimes does not update the engine db before stopping

Bug 1404812 - dwhd sometimes does not update the engine db before stopping

Summary: dwhd sometimes does not update the engine db before stopping

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine-dwh
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.1.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	ovirt-4.1.1-1
Target Release:	4.1.1
Assignee:	Shirly Radco
QA Contact:	Lukas Svaty
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1447347 (view as bug list)
Depends On:
Blocks:	1390631 1427146
TreeView+	depends on / blocked

Reported:	2016-12-14 17:35 UTC by Nikolai Sednev
Modified:	2017-05-11 10:35 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Clones:	1427146 (view as bug list)
Environment:
Last Closed:	2017-04-21 09:49:47 UTC
oVirt Team:	Metrics
Embargoed:
Flags:	ylavi: ovirt-4.1+ ykaul: blocker- ylavi: exception+ lsvaty: testing_ack+

Attachments	(Terms of Use)
boot.log (1.01 KB, text/plain) 2016-12-14 17:43 UTC, Nikolai Sednev	no flags	Details
engine.log (1.08 MB, text/plain) 2016-12-14 17:43 UTC, Nikolai Sednev	no flags	Details
ovirt-engine-dwhd.log (10.90 KB, text/plain) 2016-12-14 17:44 UTC, Nikolai Sednev	no flags	Details
ovirt-engine-setup-20161214183600-ozoyih.log (3.78 MB, text/plain) 2016-12-14 17:45 UTC, Nikolai Sednev	no flags	Details
ovirt-engine-setup-20161214192609-q08i8b.log (528.79 KB, text/plain) 2016-12-14 17:45 UTC, Nikolai Sednev	no flags	Details
server.log (145.47 KB, text/plain) 2016-12-14 17:46 UTC, Nikolai Sednev	no flags	Details
yum.log (33.94 KB, text/plain) 2016-12-15 08:58 UTC, Nikolai Sednev	no flags	Details
postgresql-Wed.log (6.36 KB, text/plain) 2016-12-15 09:01 UTC, Nikolai Sednev	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	73119	master	MERGED	history: update dwh stop process	2020-09-06 07:43:46 UTC
oVirt gerrit	73149	ovirt-engine-dwh-4.1	MERGED	history: update dwh stop process	2020-09-06 07:43:45 UTC
oVirt gerrit	73153	ovirt-engine-dwh-4.0	MERGED	history: update dwh stop process	2020-09-06 07:43:46 UTC

Description Nikolai Sednev 2016-12-14 17:35:26 UTC

Description of problem:
[ ERROR ] dwhd is currently running. Its hostname is FQDNofmyengine. Please stop it before running Setup.
[ ERROR ] Failed to execute stage 'Transaction setup': dwhd is currently running


Version-Release number of selected component (if applicable):
Engine:
ovirt-engine-4.1.0-0.0.master.20161201071307.gita5ff876.el7.centos.noarch
Linux version 3.10.0-327.36.3.el7.x86_64 (builder.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon Oct 24 16:09:20 UTC 2016
Linux 3.10.0-327.36.3.el7.x86_64 #1 SMP Mon Oct 24 16:09:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
CentOS Linux release 7.3.1611 (Core) 

Host:
ovirt-engine-appliance-4.1-20161202.1.el7.centos.noarch
vdsm-4.18.999-1020.git1ff41b1.el7.centos.x86_64
ovirt-host-deploy-1.6.0-0.0.master.20161107121647.gitfd7ddcd.el7.centos.noarch
libvirt-client-2.0.0-10.el7_3.2.x86_64
ovirt-vmconsole-1.0.4-1.el7ev.noarch
ovirt-release41-pre-4.1.0-0.0.beta.20161201085255.git731841c.el7.centos.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.2.x86_64
ovirt-hosted-engine-ha-2.1.0-0.0.master.20161130135331.20161130135328.git3541725.el7.centos.noarch
ovirt-engine-appliance-4.1-20161202.1.el7.centos.noarch
sanlock-3.4.0-1.el7.x86_64
ovirt-setup-lib-1.1.0-0.0.master.20161107100014.gitb73abeb.el7.centos.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
ovirt-imageio-common-0.5.0-0.201611201242.gitb02532b.el7.centos.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
mom-0.5.8-1.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0-0.0.master.20161130101611.gitb3ad261.el7.centos.noarch
ovirt-imageio-daemon-0.5.0-0.201611201242.gitb02532b.el7.centos.noarch
Linux version 3.10.0-514.2.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Wed Nov 16 13:15:13 EST 2016
Linux 3.10.0-514.2.2.el7.x86_64 #1 SMP Wed Nov 16 13:15:13 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)


How reproducible:
100%

Steps to Reproduce:
1.Deploy HE over NFS on RHEL7.3 host from ovirt-engine-appliance-4.1-20161202.1.el7.centos.noarch.
2.Once deployed, set host to global maintenance and then on engine run "yum install http://resources.ovirt.org/pub/yum-repo/ovirt-release41-pre.rpm -y && yum clean all && yum update -y --skip-broken"
3.Once finished, run engine-setup on engine.

Actual results:
engine-setup fails with "[ ERROR ] dwhd is currently running. Its hostname is FQDNofmyengine. Please stop it before running Setup.
[ ERROR ] Failed to execute stage 'Transaction setup': dwhd is currently running"

Expected results:
engine-setup should pass without errors.

Additional info:

Comment 1 Nikolai Sednev 2016-12-14 17:43:22 UTC

Created attachment 1231822 [details]
boot.log

Comment 2 Nikolai Sednev 2016-12-14 17:43:52 UTC

Created attachment 1231823 [details]
engine.log

Comment 3 Nikolai Sednev 2016-12-14 17:44:37 UTC

Created attachment 1231825 [details]
ovirt-engine-dwhd.log

Comment 4 Nikolai Sednev 2016-12-14 17:45:20 UTC

Created attachment 1231826 [details]
ovirt-engine-setup-20161214183600-ozoyih.log

Comment 5 Nikolai Sednev 2016-12-14 17:45:50 UTC

Created attachment 1231827 [details]
ovirt-engine-setup-20161214192609-q08i8b.log

Comment 6 Nikolai Sednev 2016-12-14 17:46:11 UTC

Created attachment 1231828 [details]
server.log

Comment 7 Yedidyah Bar David 2016-12-15 06:41:21 UTC

Please attach postgresql and yum logs. Shirly - can you please have a look? Thanks.

Comment 8 Nikolai Sednev 2016-12-15 08:58:29 UTC

Created attachment 1232051 [details]
yum.log

Comment 9 Nikolai Sednev 2016-12-15 09:01:10 UTC

Created attachment 1232052 [details]
postgresql-Wed.log

Comment 10 Nikolai Sednev 2016-12-15 09:37:08 UTC

Same issue, from 4.0.6 here: https://bugzilla.redhat.com/show_bug.cgi?id=1397005.

Comment 11 Sandro Bonazzola 2016-12-16 10:34:07 UTC

Restoring needinfo on Shirly

Comment 12 Nikolai Sednev 2016-12-19 13:06:35 UTC

Reproduced again on clean deployment.
I've seen that dwhd service was not running and engine-setup failed with the same error.

Work around is to restart dwh service.

[ ERROR ] dwhd is currently running. Its hostname is nsednev-he-4.scl.lab.tlv.redhat.com. Please stop it before running Setup.
[ ERROR ] Failed to execute stage 'Transaction setup': dwhd is currently running
[ INFO  ] Stage: Clean up
          Log file is located at /var/log/ovirt-engine/setup/ovirt-engine-setup-20161219145602-4qn6be.log
[ INFO  ] Generating answer file '/var/lib/ovirt-engine/setup/answers/20161219145658-setup.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
[ ERROR ] Execution of setup failed

# systemctl status ovirt-engine-dwhd
● ovirt-engine-dwhd.service - oVirt Engine Data Warehouse
   Loaded: loaded (/usr/lib/systemd/system/ovirt-engine-dwhd.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Mon 2016-12-19 14:56:58 IST; 2min 47s ago
 Main PID: 2359 (code=exited, status=0/SUCCESS)

Dec 19 13:57:56 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Starting oVirt Engine Data Warehouse...
Dec 19 13:57:56 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Started oVirt Engine Data Warehouse.
Dec 19 14:56:55 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Stopping oVirt Engine Data Warehouse...
Dec 19 14:56:58 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Stopped oVirt Engine Data Warehouse.
[root@nsednev-he-4 ~]# systemctl restart  ovirt-engine-dwhd
[root@nsednev-he-4 ~]# systemctl status ovirt-engine-dwhd
● ovirt-engine-dwhd.service - oVirt Engine Data Warehouse
   Loaded: loaded (/usr/lib/systemd/system/ovirt-engine-dwhd.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2016-12-19 15:00:00 IST; 6s ago
 Main PID: 27431 (ovirt-engine-dw)
   CGroup: /system.slice/ovirt-engine-dwhd.service
           ├─27431 /usr/bin/python /usr/share/ovirt-engine-dwh/services/ovirt-engine-dwhd/ovirt-engine-dwhd.py --re...
           └─27522 ovirt-engine-dwhd -Dorg.ovirt.engine.dwh.settings=/tmp/tmpntZi7P/settings.properties -classpath ...

Dec 19 14:59:59 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Starting oVirt Engine Data Warehouse...
Dec 19 15:00:00 nsednev-he-4.scl.lab.tlv.redhat.com systemd[1]: Started oVirt Engine Data Warehouse.


# engine-setup
.
.
.
.
[ INFO  ] Execution of setup completed successfully

Comment 13 Shirly Radco 2016-12-26 12:24:31 UTC

postgresql-server upgrade started at 19:03:49. This caused the postgres to shutdown at 19:04:32. This caused connection errors for engine and dwh at this time.

engine-setup was run at 19:26:53 , command for dwh to stop was run at 19:27:09 and the error was at 19:27:11. 2 seconds after.
Not sure if that is enough.
I see that on dwh log the action was received at 19:27:10 and it started but the logs stops there. 

Nikolai, Does the dwh log have more logs after 19:27:10 ?

Didi,Perhaps we need to increase the time for dwh to stop, to allow it to close properly.

Comment 14 Red Hat Bugzilla Rules Engine 2016-12-26 12:24:37 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 15 Shirly Radco 2016-12-26 12:27:09 UTC

I believe that after the postgres restart the dwh did reconnect properly and the problem is that it did not have enough time to properly close during the engine-setup.

Comment 17 Shirly Radco 2016-12-27 11:31:54 UTC

Didi, Please check why the engine-setup only waited for 2 seconds for dwh to stop. I remember it should be 4 seconds.

Comment 18 Yedidyah Bar David 2017-01-04 07:48:20 UTC

(In reply to Shirly Radco from comment #17)
> Didi, Please check why the engine-setup only waited for 2 seconds for dwh to
> stop. I remember it should be 4 seconds.

engine-setup does not wait at all after stopping any service, it's not its job.

It's up to each service to make sure that the 'stop' command, in this case 'systemctl stop ovirt-engine-dwhd', returns only after it fully finished to stop.

If you suspect that it's something in the stop process (and not in reconnect to pg), you can try debugging this also using OVIRT_SERVICE_DEBUG, see bz 1230165.

Comment 19 Shirly Radco 2017-01-04 08:26:04 UTC

Nikolai, Please try running this as  Didi suggests, using OVIRT_SERVICE_DEBUG

Comment 20 Nikolai Sednev 2017-01-04 08:55:17 UTC

(In reply to Shirly Radco from comment #19)
> Nikolai, Please try running this as  Didi suggests, using OVIRT_SERVICE_DEBUG

I will try to reproduce once again, and if successfull, will attach additional required information.

Comment 21 Yaniv Kaul 2017-01-19 08:55:10 UTC

(In reply to Nikolai Sednev from comment #20)
> (In reply to Shirly Radco from comment #19)
> > Nikolai, Please try running this as  Didi suggests, using OVIRT_SERVICE_DEBUG
> 
> I will try to reproduce once again, and if successfull, will attach
> additional required information.

Hi Nikolai,

I'm closing this bug for the time being. If you manage to reproduce, please re-open with additional details. I'm quite sure we are hit by this from time to time (I kind of remember community reports) but do not have root cause yet, so it'd be great if it can be reproduced with more information.

Comment 22 Shirly Radco 2017-01-19 11:02:03 UTC

I'm reopening, since I believe this is due to the fact that dwh did not manage to stop in the time and the engine just killed it. It waited for less then 2 seconds.
In remote installations we have a mechanism that retries to stop it a few times before notifying that the service was unable to stop.
We need to have this in local installation as well so it manages to finish and update the engine database.

Comment 23 Yedidyah Bar David 2017-01-19 11:23:00 UTC

I'd like to clarify. In remote installations, we have two machines - engine on machine A and dwh on machine B.

1. When running engine-setup on A, to update the engine, it notifies the remote dwhd to stop, by writing a value to its db. The remote dwhd is checking this value, and when it's set, it exits, also setting another value, so that engine-setup on A knows that it exited.

2. When running engine-setup on machine B, to update dwh, it simply stops dwhd.

3. When both are on the same machine, the code of engine-setup dwh plugin behaves the same - it just stops dwhd, as in (2.). So when later the code in engine plugin of engine-setup runs, similarly to (1.), dwhd was already stopped. If dwhd did not update the database that it exited, for any reason, we see the behaviour of current bug (and similar) - it loops for a minute then gives up.

Generally speaking, the behaviour of (2.) is reasonable - every user expects to be able to do 'service SOME_SERVICE stop' and have it cleanly stopped.

If we want to change the behaviour, and make the dwh plugin of engine-setup to do the same as the engine plugin does - ask it to exit and wait until it does - we can, but: A. This is not guaranteed to solve every case. If the problem is that dwhd failed to reconnect to the db, a minute might not be enough. And we did see cases where more than a minute was not enough. B. This will break if the user manually stopped dwhd.

In addition, we still do not have a 100% successful reproduction flow. So we can't know if the bug is solved.

Comment 24 Yaniv Kaul 2017-01-22 13:37:56 UTC

Shirly - what do we want to do with this bug?
I don't mind trying to fix - but what exactly?
Is this a blocker to 4.1 GA?

Comment 25 Shirly Radco 2017-01-23 08:11:10 UTC

This is not a blocker for the GA AFAIU.
It does not happen on every upgrade/installation but only on specific cases and we do have a workaround.

I think that in order to avoid this bug in the future we should use the same mechanism as in the remote installation and not base on the time the 'service SOME_SERVICE stop' reports.

The parent process notifies that the service has stopped, even though the child dwh demon is still closing the threads.

'service SOME_SERVICE stop' does stop the process cleanly, but the bug is that the time it reports that it stopped is unrelated to when the child process really stopped.

Comment 26 Yedidyah Bar David 2017-02-02 13:36:23 UTC

Discussed this with Shirly in private a few days ago.

We agreed to try and change dwhd as follows:
1. In the places that update the engine db before exiting (dwh_history_timekeeping and audit_log), try to reconnect to engine db prior to trying to update it.
2. Add more logging - before and after (re)connecting to engine db, etc.

Restoring also needinfo of comment 19. Nikolai - if you have a reproducer, please try it with OVIRT_SERVICE_DEBUG (and make sure it works with dwhd prior to actually trying to reproduce). Thanks.

Comment 27 Nikolai Sednev 2017-02-05 14:50:05 UTC

(In reply to Yedidyah Bar David from comment #26)
> Discussed this with Shirly in private a few days ago.
> 
> We agreed to try and change dwhd as follows:
> 1. In the places that update the engine db before exiting
> (dwh_history_timekeeping and audit_log), try to reconnect to engine db prior
> to trying to update it.
> 2. Add more logging - before and after (re)connecting to engine db, etc.
> 
> Restoring also needinfo of comment 19. Nikolai - if you have a reproducer,
> please try it with OVIRT_SERVICE_DEBUG (and make sure it works with dwhd
> prior to actually trying to reproduce). Thanks.

Not being reproduced any more.

Comment 28 Lukas Svaty 2017-03-30 12:38:42 UTC

I was not able to reproduce this issue in:

ovirt-engine-dwh-setup-4.1.1-1.el7ev.noarch
ovirt-engine-dwh-4.1.1-1.el7ev.noarch

any chance you were able to reproduce this after the patch?
If not we can move this to verified, as we dont have proper verification steps, and reopeen in case this issue re-appears.

Comment 29 Nikolai Sednev 2017-03-30 15:16:06 UTC

I did not seen this being reproduced for a while.

Comment 30 Lukas Svaty 2017-03-31 10:39:17 UTC

moving to VERIFIED based on previous comments

Comment 31 Yedidyah Bar David 2017-04-02 09:02:53 UTC

This bug happened to me several times personally.

The "reproducer" was:
1. Setup engine+dwh
2. Wait $RANDOM1
3. Restart dwhd
4. Wait $RANDOM2
5. Restart postgresql
6. engine-setup --offline

I did not spend enough time to get to accurate concrete RANDOM1 and RANDOM2 that trigger it.

Comment 32 Yedidyah Bar David 2017-05-07 08:46:48 UTC

*** Bug 1447347 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.