Bug 1507037 - Race condition on starting DWH on fresh install.
Summary: Race condition on starting DWH on fresh install.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine-dwh
Classification: oVirt
Component: General
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ovirt-4.3.0
: ---
Assignee: Ido Rosenzwig
QA Contact: Lucie Leistnerova
URL:
Whiteboard:
Depends On:
Blocks: 1631202
TreeView+ depends on / blocked
 
Reported: 2017-10-27 13:44 UTC by Sandro Bonazzola
Modified: 2019-02-13 07:45 UTC (History)
4 users (show)

Fixed In Version: ovirt-engine-dwh-4.3.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-13 07:45:01 UTC
oVirt Team: Integration
Embargoed:
rule-engine: ovirt-4.3+


Attachments (Terms of Use)
setup logs (221.24 KB, application/x-xz)
2017-10-27 13:45 UTC, Sandro Bonazzola
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 85101 0 'None' 'ABANDONED' 'setup: plugins: start dwh service after ovirt-engine service' 2019-11-18 03:09:50 UTC
oVirt gerrit 87716 0 'None' 'ABANDONED' 'history: wait 60 seconds before sampling' 2019-11-18 03:09:50 UTC
oVirt gerrit 88173 0 'None' 'ABANDONED' 'setup: plugins: start dwh service after ovirt-engine service' 2019-11-18 03:09:50 UTC
oVirt gerrit 88453 0 'None' 'MERGED' 'history: wait longer before first sampling' 2019-11-18 03:09:50 UTC

Description Sandro Bonazzola 2017-10-27 13:44:07 UTC
Description of problem:
Just installed a fresh engine on a fresh CentOS  using ovirt-release-master.
After engine-setup, connected to the admin interface, it shows 1 error:
"ETL service sampling has encountered an error. Please consult the service log for more details."

I assume "service log" refers to "/var/log/ovirt-engine-dwh/ovirt-engine-dwhd.log", the phrasing here can be improved, please consider opening a bug for it.

The log contains:
2017-10-27 15:16:58|ETL Service Started
ovirtEngineDbDriverClass|org.postgresql.Driver
ovirtEngineHistoryDbJdbcConnection|jdbc:postgresql://localhost:5432/ovirt_engine_history?sslfactory=org.postgresql.ssl.NonValidatingFactory
hoursToKeepDaily|0
hoursToKeepHourly|720
ovirtEngineDbPassword|**********************
runDeleteTime|3
ovirtEngineDbJdbcConnection|jdbc:postgresql://localhost:5432/engine?sslfactory=org.postgresql.ssl.NonValidatingFactory
runInterleave|60
limitRows|limit 1000
ovirtEngineHistoryDbUser|ovirt_engine_history
ovirtEngineDbUser|engine
deleteIncrement|10
timeBetweenErrorEvents|300000
hoursToKeepSamples|24
deleteMultiplier|1000
lastErrorSent|2011-07-03 12:46:47.000000
etlVersion|4.2.1
dwhAggregationDebug|false
dwhUuid|0343f512-ed12-4882-8369-37fd3ab1ef97
ovirtEngineHistoryDbDriverClass|org.postgresql.Driver
ovirtEngineHistoryDbPassword|**********************
2017-10-27 15:17:00|rtV5Cj|tHpOYa|dpjN55|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|Default|5|tWarn|tWarn_1|Can not sample data, oVirt Engine is not updating the statistics. Please check your oVirt Engine status.|9704

The error is reported at 15:17:00 while the first line in engine.log is at 15:17:12:
2017-10-27 15:17:12,860+02 INFO  [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 48) [] Loaded file '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'.

And server.log first log is at 15:17:01
2017-10-27 15:17:01,187+02 INFO  [org.jboss.remoting] (MSC service thread 1-1) JBoss Remoting version 5.0.5.Final

So, looks like DWH is started at least 12 seconds before the engine start producing data.

We should wait for the engine to be up before starting DWH.

Version-Release number of selected component (if applicable):
# rpm -qa |grep ovirt-engine
ovirt-engine-setup-plugin-websocket-proxy-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-dwh-4.2.1-0.2.master.20171026133424.el7.centos.noarch
ovirt-engine-setup-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-restapi-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-backend-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-sdk-python-3.6.9.2-0.1.20161204.gite99bbd1.el7.centos.noarch
ovirt-engine-lib-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-setup-base-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-wildfly-11.0.0-1.el7.centos.x86_64
ovirt-engine-cli-3.6.9.3-0.1.20170116.gitad0561a.el7.centos.noarch
ovirt-engine-setup-plugin-ovirt-engine-common-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-tools-backup-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-websocket-proxy-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-dwh-setup-4.2.1-0.2.master.20171026133424.el7.centos.noarch
ovirt-engine-webadmin-portal-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-vmconsole-proxy-helper-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-setup-plugin-ovirt-engine-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-tools-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-api-explorer-0.0.1-0.0.20171013git1fbdc17.el7.centos.noarch
ovirt-engine-dbscripts-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-extensions-api-impl-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-wildfly-overlay-11.0.1-1.el7.centos.noarch
ovirt-engine-dashboard-1.2.0-0.5.20171012git78f7a28.el7.centos.noarch
ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-metrics-1.1.1-0.0.master.20171019113406.el7.centos.noarch
ovirt-engine-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-extension-aaa-jdbc-1.1.7-0.0.master.git1660a27.el7.centos.noarch
python-ovirt-engine-sdk4-4.2.1-1.a2.20171023gita89e6a9.el7.centos.x86_64


How reproducible:
I tried only once

Steps to Reproduce:
1. Install and setup oVirt Engine with DWH
2. Login to administration interface
3. See the error

Actual results:
Error is displayed after a clean setup

Expected results:
No errors after setup

Additional info:
See attached setup log for the sequence used by setup.

Comment 1 Sandro Bonazzola 2017-10-27 13:45:05 UTC
Created attachment 1344308 [details]
setup logs

Comment 2 Yedidyah Bar David 2017-12-05 13:21:19 UTC
(In reply to Sandro Bonazzola from comment #0)
> Description of problem:
> Just installed a fresh engine on a fresh CentOS  using ovirt-release-master.
> After engine-setup, connected to the admin interface, it shows 1 error:
> "ETL service sampling has encountered an error. Please consult the service
> log for more details."
> 
> I assume "service log" refers to
> "/var/log/ovirt-engine-dwh/ovirt-engine-dwhd.log", the phrasing here can be
> improved, please consider opening a bug for it.

Agreed.

> 
> The log contains:
> 2017-10-27 15:16:58|ETL Service Started
> ovirtEngineDbDriverClass|org.postgresql.Driver
> ovirtEngineHistoryDbJdbcConnection|jdbc:postgresql://localhost:5432/
> ovirt_engine_history?sslfactory=org.postgresql.ssl.NonValidatingFactory
> hoursToKeepDaily|0
> hoursToKeepHourly|720
> ovirtEngineDbPassword|**********************
> runDeleteTime|3
> ovirtEngineDbJdbcConnection|jdbc:postgresql://localhost:5432/
> engine?sslfactory=org.postgresql.ssl.NonValidatingFactory
> runInterleave|60
> limitRows|limit 1000
> ovirtEngineHistoryDbUser|ovirt_engine_history
> ovirtEngineDbUser|engine
> deleteIncrement|10
> timeBetweenErrorEvents|300000
> hoursToKeepSamples|24
> deleteMultiplier|1000
> lastErrorSent|2011-07-03 12:46:47.000000
> etlVersion|4.2.1
> dwhAggregationDebug|false
> dwhUuid|0343f512-ed12-4882-8369-37fd3ab1ef97
> ovirtEngineHistoryDbDriverClass|org.postgresql.Driver
> ovirtEngineHistoryDbPassword|**********************
> 2017-10-27
> 15:17:
> 00|rtV5Cj|tHpOYa|dpjN55|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|Default|5|tWarn
> |tWarn_1|Can not sample data, oVirt Engine is not updating the statistics.
> Please check your oVirt Engine status.|9704
> 
> The error is reported at 15:17:00 while the first line in engine.log is at
> 15:17:12:
> 2017-10-27 15:17:12,860+02 INFO 
> [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread
> Pool -- 48) [] Loaded file
> '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'.
> 
> And server.log first log is at 15:17:01
> 2017-10-27 15:17:01,187+02 INFO  [org.jboss.remoting] (MSC service thread
> 1-1) JBoss Remoting version 5.0.5.Final
> 
> So, looks like DWH is started at least 12 seconds before the engine start
> producing data.
> 
> We should wait for the engine to be up before starting DWH.

I do not agree.

This requires needlessly to make engine-setup wait and take longer.

An alternative solution might be to make dwh emit such errors only after a certain amount of time since it was started - say, 1 minute or so.

Shirly, what do you think?

Comment 3 Shirly Radco 2018-01-11 08:47:08 UTC
I agree. Moving this to me. You can abandon the patch.

Comment 4 Shirly Radco 2018-02-15 10:02:03 UTC
After discussion with Yaniv Lavi, this needs to be handled during the engine setup since we do not want dwh to wait after every restart.

Moving this back to Ido.

Yaniv, please ack.

Comment 5 Yedidyah Bar David 2018-02-15 10:28:49 UTC
Don't we have (potentially) the same issue on system reboot? That dwh will start before the engine started, or at least before the engine has data?

Even unrelated to timing, I do not think we want to emit false alarms at all. Doing this risks loosing users' attention when a real problem happened.

If dwh checks every 60 seconds, and for one time didn't see updates, I do not think we should alert users. I'd say, we want something like this, in pseudo-python:

min_time_for_alerts = 200 # seconds
time_between_samples = 60 # seconds
last_good_sample_time = now()
while true; do
    try_to_sample
    if found data:
        last_good_sampe_time = now()
        process data
    else:
        if now() - last_good_sample_time > min_time_for_alerts then:
            alert
    sleep(time_between_samples)

That is, the alerts should not interfere with the sampling loop/logic. Also 200 does not need to be divisible by 60.

Comment 6 Yaniv Lavi 2018-02-26 13:20:17 UTC
(In reply to Yedidyah Bar David from comment #5)
> Don't we have (potentially) the same issue on system reboot? That dwh will
> start before the engine started, or at least before the engine has data?
> 
> Even unrelated to timing, I do not think we want to emit false alarms at
> all. Doing this risks loosing users' attention when a real problem happened.
> 
> If dwh checks every 60 seconds, and for one time didn't see updates, I do
> not think we should alert users. I'd say, we want something like this, in
> pseudo-python:
> 
> min_time_for_alerts = 200 # seconds
> time_between_samples = 60 # seconds
> last_good_sample_time = now()
> while true; do
>     try_to_sample
>     if found data:
>         last_good_sampe_time = now()
>         process data
>     else:
>         if now() - last_good_sample_time > min_time_for_alerts then:
>             alert
>     sleep(time_between_samples)
> 
> That is, the alerts should not interfere with the sampling loop/logic. Also
> 200 does not need to be divisible by 60.

We have a anti-flood mechanism already in DWH if errors happen every sample.

Comment 7 Yedidyah Bar David 2018-02-26 13:32:48 UTC
(In reply to Yaniv Lavi from comment #6)
> We have a anti-flood mechanism already in DWH if errors happen every sample.

Good, so should we fine-tune it also for current case?

Comment 8 Yaniv Lavi 2018-02-26 13:40:45 UTC
(In reply to Yedidyah Bar David from comment #7)
> (In reply to Yaniv Lavi from comment #6)
> > We have a anti-flood mechanism already in DWH if errors happen every sample.
> 
> Good, so should we fine-tune it also for current case?

I'm not sure since we don't want to block every service start, only on the boot and we have no way to know the difference.
We could lose debug info we will need if we block error on service start for 20 seconds for example.

Comment 9 Yedidyah Bar David 2018-02-26 13:58:14 UTC
(In reply to Yaniv Lavi from comment #8)
> (In reply to Yedidyah Bar David from comment #7)
> > (In reply to Yaniv Lavi from comment #6)
> > > We have a anti-flood mechanism already in DWH if errors happen every sample.
> > 
> > Good, so should we fine-tune it also for current case?
> 
> I'm not sure since we don't want to block every service start, only on the
> boot and we have no way to know the difference.
> We could lose debug info we will need if we block error on service start for
> 20 seconds for example.

Not sure which debug info.

I am not suggesting to block logging _errors_, only this specific alert "Can not sample data, oVirt Engine is not updating...". It's not accompanied by any other data, AFAICT. If it happens more than a few times, we'll log something eventually. If you want to know exactly when it started, you can do that in the alert in comment 5 - alert "did not get new data since $last_good_sample_time".

Comment 10 Lucie Leistnerova 2018-09-07 11:03:40 UTC
No ETL error after fresh install and reboot

verified in ovirt-engine-dwh-4.3.0-0.0.master.20180705143342.el7.noarch

Comment 11 Sandro Bonazzola 2018-11-02 14:32:52 UTC
This bugzilla is included in oVirt 4.2.7 release, published on November 2nd 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.7 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 12 Sandro Bonazzola 2018-11-02 14:50:31 UTC
Closed by mistake, moving back to qa -> verified

Comment 13 Sandro Bonazzola 2019-02-13 07:45:01 UTC
This bugzilla is included in oVirt 4.3.0 release, published on February 4th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.