Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1507037

Summary:

Race condition on starting DWH on fresh install.

Product:

[oVirt] ovirt-engine-dwh

Reporter:

Sandro Bonazzola <sbonazzo>

Component:

General

Assignee:

Ido Rosenzwig <irosenzw>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Lucie Leistnerova <lleistne>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

4.2.0

CC:

bugs, didi, lsvaty, sradco

Target Milestone:

ovirt-4.3.0

Flags:

rule-engine: ovirt-4.3+

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

ovirt-engine-dwh-4.3.0

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-02-13 07:45:01 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Integration

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1631202

Attachments:

Description	Flags
setup logs	none

Description Sandro Bonazzola 2017-10-27 13:44:07 UTC

Description of problem:
Just installed a fresh engine on a fresh CentOS  using ovirt-release-master.
After engine-setup, connected to the admin interface, it shows 1 error:
"ETL service sampling has encountered an error. Please consult the service log for more details."

I assume "service log" refers to "/var/log/ovirt-engine-dwh/ovirt-engine-dwhd.log", the phrasing here can be improved, please consider opening a bug for it.

The log contains:
2017-10-27 15:16:58|ETL Service Started
ovirtEngineDbDriverClass|org.postgresql.Driver
ovirtEngineHistoryDbJdbcConnection|jdbc:postgresql://localhost:5432/ovirt_engine_history?sslfactory=org.postgresql.ssl.NonValidatingFactory
hoursToKeepDaily|0
hoursToKeepHourly|720
ovirtEngineDbPassword|**********************
runDeleteTime|3
ovirtEngineDbJdbcConnection|jdbc:postgresql://localhost:5432/engine?sslfactory=org.postgresql.ssl.NonValidatingFactory
runInterleave|60
limitRows|limit 1000
ovirtEngineHistoryDbUser|ovirt_engine_history
ovirtEngineDbUser|engine
deleteIncrement|10
timeBetweenErrorEvents|300000
hoursToKeepSamples|24
deleteMultiplier|1000
lastErrorSent|2011-07-03 12:46:47.000000
etlVersion|4.2.1
dwhAggregationDebug|false
dwhUuid|0343f512-ed12-4882-8369-37fd3ab1ef97
ovirtEngineHistoryDbDriverClass|org.postgresql.Driver
ovirtEngineHistoryDbPassword|**********************
2017-10-27 15:17:00|rtV5Cj|tHpOYa|dpjN55|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|Default|5|tWarn|tWarn_1|Can not sample data, oVirt Engine is not updating the statistics. Please check your oVirt Engine status.|9704

The error is reported at 15:17:00 while the first line in engine.log is at 15:17:12:
2017-10-27 15:17:12,860+02 INFO  [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 48) [] Loaded file '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'.

And server.log first log is at 15:17:01
2017-10-27 15:17:01,187+02 INFO  [org.jboss.remoting] (MSC service thread 1-1) JBoss Remoting version 5.0.5.Final

So, looks like DWH is started at least 12 seconds before the engine start producing data.

We should wait for the engine to be up before starting DWH.

Version-Release number of selected component (if applicable):
# rpm -qa |grep ovirt-engine
ovirt-engine-setup-plugin-websocket-proxy-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-dwh-4.2.1-0.2.master.20171026133424.el7.centos.noarch
ovirt-engine-setup-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-restapi-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-backend-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-sdk-python-3.6.9.2-0.1.20161204.gite99bbd1.el7.centos.noarch
ovirt-engine-lib-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-setup-base-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-wildfly-11.0.0-1.el7.centos.x86_64
ovirt-engine-cli-3.6.9.3-0.1.20170116.gitad0561a.el7.centos.noarch
ovirt-engine-setup-plugin-ovirt-engine-common-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-tools-backup-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-websocket-proxy-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-dwh-setup-4.2.1-0.2.master.20171026133424.el7.centos.noarch
ovirt-engine-webadmin-portal-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-vmconsole-proxy-helper-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-setup-plugin-ovirt-engine-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-tools-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-api-explorer-0.0.1-0.0.20171013git1fbdc17.el7.centos.noarch
ovirt-engine-dbscripts-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-extensions-api-impl-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-wildfly-overlay-11.0.1-1.el7.centos.noarch
ovirt-engine-dashboard-1.2.0-0.5.20171012git78f7a28.el7.centos.noarch
ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-metrics-1.1.1-0.0.master.20171019113406.el7.centos.noarch
ovirt-engine-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch
ovirt-engine-extension-aaa-jdbc-1.1.7-0.0.master.git1660a27.el7.centos.noarch
python-ovirt-engine-sdk4-4.2.1-1.a2.20171023gita89e6a9.el7.centos.x86_64


How reproducible:
I tried only once

Steps to Reproduce:
1. Install and setup oVirt Engine with DWH
2. Login to administration interface
3. See the error

Actual results:
Error is displayed after a clean setup

Expected results:
No errors after setup

Additional info:
See attached setup log for the sequence used by setup.

Comment 1 Sandro Bonazzola 2017-10-27 13:45:05 UTC

Created attachment 1344308 [details]
setup logs

Comment 2 Yedidyah Bar David 2017-12-05 13:21:19 UTC

(In reply to Sandro Bonazzola from comment #0)
> Description of problem:
> Just installed a fresh engine on a fresh CentOS  using ovirt-release-master.
> After engine-setup, connected to the admin interface, it shows 1 error:
> "ETL service sampling has encountered an error. Please consult the service
> log for more details."
> 
> I assume "service log" refers to
> "/var/log/ovirt-engine-dwh/ovirt-engine-dwhd.log", the phrasing here can be
> improved, please consider opening a bug for it.

Agreed.

> 
> The log contains:
> 2017-10-27 15:16:58|ETL Service Started
> ovirtEngineDbDriverClass|org.postgresql.Driver
> ovirtEngineHistoryDbJdbcConnection|jdbc:postgresql://localhost:5432/
> ovirt_engine_history?sslfactory=org.postgresql.ssl.NonValidatingFactory
> hoursToKeepDaily|0
> hoursToKeepHourly|720
> ovirtEngineDbPassword|**********************
> runDeleteTime|3
> ovirtEngineDbJdbcConnection|jdbc:postgresql://localhost:5432/
> engine?sslfactory=org.postgresql.ssl.NonValidatingFactory
> runInterleave|60
> limitRows|limit 1000
> ovirtEngineHistoryDbUser|ovirt_engine_history
> ovirtEngineDbUser|engine
> deleteIncrement|10
> timeBetweenErrorEvents|300000
> hoursToKeepSamples|24
> deleteMultiplier|1000
> lastErrorSent|2011-07-03 12:46:47.000000
> etlVersion|4.2.1
> dwhAggregationDebug|false
> dwhUuid|0343f512-ed12-4882-8369-37fd3ab1ef97
> ovirtEngineHistoryDbDriverClass|org.postgresql.Driver
> ovirtEngineHistoryDbPassword|**********************
> 2017-10-27
> 15:17:
> 00|rtV5Cj|tHpOYa|dpjN55|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|Default|5|tWarn
> |tWarn_1|Can not sample data, oVirt Engine is not updating the statistics.
> Please check your oVirt Engine status.|9704
> 
> The error is reported at 15:17:00 while the first line in engine.log is at
> 15:17:12:
> 2017-10-27 15:17:12,860+02 INFO 
> [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread
> Pool -- 48) [] Loaded file
> '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'.
> 
> And server.log first log is at 15:17:01
> 2017-10-27 15:17:01,187+02 INFO  [org.jboss.remoting] (MSC service thread
> 1-1) JBoss Remoting version 5.0.5.Final
> 
> So, looks like DWH is started at least 12 seconds before the engine start
> producing data.
> 
> We should wait for the engine to be up before starting DWH.

I do not agree.

This requires needlessly to make engine-setup wait and take longer.

An alternative solution might be to make dwh emit such errors only after a certain amount of time since it was started - say, 1 minute or so.

Shirly, what do you think?

Comment 3 Shirly Radco 2018-01-11 08:47:08 UTC

I agree. Moving this to me. You can abandon the patch.

Comment 4 Shirly Radco 2018-02-15 10:02:03 UTC

After discussion with Yaniv Lavi, this needs to be handled during the engine setup since we do not want dwh to wait after every restart.

Moving this back to Ido.

Yaniv, please ack.

Comment 5 Yedidyah Bar David 2018-02-15 10:28:49 UTC

Don't we have (potentially) the same issue on system reboot? That dwh will start before the engine started, or at least before the engine has data?

Even unrelated to timing, I do not think we want to emit false alarms at all. Doing this risks loosing users' attention when a real problem happened.

If dwh checks every 60 seconds, and for one time didn't see updates, I do not think we should alert users. I'd say, we want something like this, in pseudo-python:

min_time_for_alerts = 200 # seconds
time_between_samples = 60 # seconds
last_good_sample_time = now()
while true; do
    try_to_sample
    if found data:
        last_good_sampe_time = now()
        process data
    else:
        if now() - last_good_sample_time > min_time_for_alerts then:
            alert
    sleep(time_between_samples)

That is, the alerts should not interfere with the sampling loop/logic. Also 200 does not need to be divisible by 60.

Comment 6 Yaniv Lavi 2018-02-26 13:20:17 UTC

(In reply to Yedidyah Bar David from comment #5)
> Don't we have (potentially) the same issue on system reboot? That dwh will
> start before the engine started, or at least before the engine has data?
> 
> Even unrelated to timing, I do not think we want to emit false alarms at
> all. Doing this risks loosing users' attention when a real problem happened.
> 
> If dwh checks every 60 seconds, and for one time didn't see updates, I do
> not think we should alert users. I'd say, we want something like this, in
> pseudo-python:
> 
> min_time_for_alerts = 200 # seconds
> time_between_samples = 60 # seconds
> last_good_sample_time = now()
> while true; do
>     try_to_sample
>     if found data:
>         last_good_sampe_time = now()
>         process data
>     else:
>         if now() - last_good_sample_time > min_time_for_alerts then:
>             alert
>     sleep(time_between_samples)
> 
> That is, the alerts should not interfere with the sampling loop/logic. Also
> 200 does not need to be divisible by 60.

We have a anti-flood mechanism already in DWH if errors happen every sample.

Comment 7 Yedidyah Bar David 2018-02-26 13:32:48 UTC

(In reply to Yaniv Lavi from comment #6)
> We have a anti-flood mechanism already in DWH if errors happen every sample.

Good, so should we fine-tune it also for current case?

Comment 8 Yaniv Lavi 2018-02-26 13:40:45 UTC

(In reply to Yedidyah Bar David from comment #7)
> (In reply to Yaniv Lavi from comment #6)
> > We have a anti-flood mechanism already in DWH if errors happen every sample.
> 
> Good, so should we fine-tune it also for current case?

I'm not sure since we don't want to block every service start, only on the boot and we have no way to know the difference.
We could lose debug info we will need if we block error on service start for 20 seconds for example.

Comment 9 Yedidyah Bar David 2018-02-26 13:58:14 UTC

(In reply to Yaniv Lavi from comment #8)
> (In reply to Yedidyah Bar David from comment #7)
> > (In reply to Yaniv Lavi from comment #6)
> > > We have a anti-flood mechanism already in DWH if errors happen every sample.
> > 
> > Good, so should we fine-tune it also for current case?
> 
> I'm not sure since we don't want to block every service start, only on the
> boot and we have no way to know the difference.
> We could lose debug info we will need if we block error on service start for
> 20 seconds for example.

Not sure which debug info.

I am not suggesting to block logging _errors_, only this specific alert "Can not sample data, oVirt Engine is not updating...". It's not accompanied by any other data, AFAICT. If it happens more than a few times, we'll log something eventually. If you want to know exactly when it started, you can do that in the alert in comment 5 - alert "did not get new data since $last_good_sample_time".

Comment 10 Lucie Leistnerova 2018-09-07 11:03:40 UTC

No ETL error after fresh install and reboot

verified in ovirt-engine-dwh-4.3.0-0.0.master.20180705143342.el7.noarch

Comment 11 Sandro Bonazzola 2018-11-02 14:32:52 UTC

This bugzilla is included in oVirt 4.2.7 release, published on November 2nd 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.7 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 12 Sandro Bonazzola 2018-11-02 14:50:31 UTC

Closed by mistake, moving back to qa -> verified

Comment 13 Sandro Bonazzola 2019-02-13 07:45:01 UTC

This bugzilla is included in oVirt 4.3.0 release, published on February 4th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.