Bug 1507037
| Summary: | Race condition on starting DWH on fresh install. | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine-dwh | Reporter: | Sandro Bonazzola <sbonazzo> | ||||
| Component: | General | Assignee: | Ido Rosenzwig <irosenzw> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Lucie Leistnerova <lleistne> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 4.2.0 | CC: | bugs, didi, lsvaty, sradco | ||||
| Target Milestone: | ovirt-4.3.0 | Flags: | rule-engine:
ovirt-4.3+
|
||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | ovirt-engine-dwh-4.3.0 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-02-13 07:45:01 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1631202 | ||||||
| Attachments: |
|
||||||
|
Description
Sandro Bonazzola
2017-10-27 13:44:07 UTC
Created attachment 1344308 [details]
setup logs
(In reply to Sandro Bonazzola from comment #0) > Description of problem: > Just installed a fresh engine on a fresh CentOS using ovirt-release-master. > After engine-setup, connected to the admin interface, it shows 1 error: > "ETL service sampling has encountered an error. Please consult the service > log for more details." > > I assume "service log" refers to > "/var/log/ovirt-engine-dwh/ovirt-engine-dwhd.log", the phrasing here can be > improved, please consider opening a bug for it. Agreed. > > The log contains: > 2017-10-27 15:16:58|ETL Service Started > ovirtEngineDbDriverClass|org.postgresql.Driver > ovirtEngineHistoryDbJdbcConnection|jdbc:postgresql://localhost:5432/ > ovirt_engine_history?sslfactory=org.postgresql.ssl.NonValidatingFactory > hoursToKeepDaily|0 > hoursToKeepHourly|720 > ovirtEngineDbPassword|********************** > runDeleteTime|3 > ovirtEngineDbJdbcConnection|jdbc:postgresql://localhost:5432/ > engine?sslfactory=org.postgresql.ssl.NonValidatingFactory > runInterleave|60 > limitRows|limit 1000 > ovirtEngineHistoryDbUser|ovirt_engine_history > ovirtEngineDbUser|engine > deleteIncrement|10 > timeBetweenErrorEvents|300000 > hoursToKeepSamples|24 > deleteMultiplier|1000 > lastErrorSent|2011-07-03 12:46:47.000000 > etlVersion|4.2.1 > dwhAggregationDebug|false > dwhUuid|0343f512-ed12-4882-8369-37fd3ab1ef97 > ovirtEngineHistoryDbDriverClass|org.postgresql.Driver > ovirtEngineHistoryDbPassword|********************** > 2017-10-27 > 15:17: > 00|rtV5Cj|tHpOYa|dpjN55|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|Default|5|tWarn > |tWarn_1|Can not sample data, oVirt Engine is not updating the statistics. > Please check your oVirt Engine status.|9704 > > The error is reported at 15:17:00 while the first line in engine.log is at > 15:17:12: > 2017-10-27 15:17:12,860+02 INFO > [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread > Pool -- 48) [] Loaded file > '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'. > > And server.log first log is at 15:17:01 > 2017-10-27 15:17:01,187+02 INFO [org.jboss.remoting] (MSC service thread > 1-1) JBoss Remoting version 5.0.5.Final > > So, looks like DWH is started at least 12 seconds before the engine start > producing data. > > We should wait for the engine to be up before starting DWH. I do not agree. This requires needlessly to make engine-setup wait and take longer. An alternative solution might be to make dwh emit such errors only after a certain amount of time since it was started - say, 1 minute or so. Shirly, what do you think? I agree. Moving this to me. You can abandon the patch. After discussion with Yaniv Lavi, this needs to be handled during the engine setup since we do not want dwh to wait after every restart. Moving this back to Ido. Yaniv, please ack. Don't we have (potentially) the same issue on system reboot? That dwh will start before the engine started, or at least before the engine has data?
Even unrelated to timing, I do not think we want to emit false alarms at all. Doing this risks loosing users' attention when a real problem happened.
If dwh checks every 60 seconds, and for one time didn't see updates, I do not think we should alert users. I'd say, we want something like this, in pseudo-python:
min_time_for_alerts = 200 # seconds
time_between_samples = 60 # seconds
last_good_sample_time = now()
while true; do
try_to_sample
if found data:
last_good_sampe_time = now()
process data
else:
if now() - last_good_sample_time > min_time_for_alerts then:
alert
sleep(time_between_samples)
That is, the alerts should not interfere with the sampling loop/logic. Also 200 does not need to be divisible by 60.
(In reply to Yedidyah Bar David from comment #5) > Don't we have (potentially) the same issue on system reboot? That dwh will > start before the engine started, or at least before the engine has data? > > Even unrelated to timing, I do not think we want to emit false alarms at > all. Doing this risks loosing users' attention when a real problem happened. > > If dwh checks every 60 seconds, and for one time didn't see updates, I do > not think we should alert users. I'd say, we want something like this, in > pseudo-python: > > min_time_for_alerts = 200 # seconds > time_between_samples = 60 # seconds > last_good_sample_time = now() > while true; do > try_to_sample > if found data: > last_good_sampe_time = now() > process data > else: > if now() - last_good_sample_time > min_time_for_alerts then: > alert > sleep(time_between_samples) > > That is, the alerts should not interfere with the sampling loop/logic. Also > 200 does not need to be divisible by 60. We have a anti-flood mechanism already in DWH if errors happen every sample. (In reply to Yaniv Lavi from comment #6) > We have a anti-flood mechanism already in DWH if errors happen every sample. Good, so should we fine-tune it also for current case? (In reply to Yedidyah Bar David from comment #7) > (In reply to Yaniv Lavi from comment #6) > > We have a anti-flood mechanism already in DWH if errors happen every sample. > > Good, so should we fine-tune it also for current case? I'm not sure since we don't want to block every service start, only on the boot and we have no way to know the difference. We could lose debug info we will need if we block error on service start for 20 seconds for example. (In reply to Yaniv Lavi from comment #8) > (In reply to Yedidyah Bar David from comment #7) > > (In reply to Yaniv Lavi from comment #6) > > > We have a anti-flood mechanism already in DWH if errors happen every sample. > > > > Good, so should we fine-tune it also for current case? > > I'm not sure since we don't want to block every service start, only on the > boot and we have no way to know the difference. > We could lose debug info we will need if we block error on service start for > 20 seconds for example. Not sure which debug info. I am not suggesting to block logging _errors_, only this specific alert "Can not sample data, oVirt Engine is not updating...". It's not accompanied by any other data, AFAICT. If it happens more than a few times, we'll log something eventually. If you want to know exactly when it started, you can do that in the alert in comment 5 - alert "did not get new data since $last_good_sample_time". No ETL error after fresh install and reboot verified in ovirt-engine-dwh-4.3.0-0.0.master.20180705143342.el7.noarch This bugzilla is included in oVirt 4.2.7 release, published on November 2nd 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.7 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. Closed by mistake, moving back to qa -> verified This bugzilla is included in oVirt 4.3.0 release, published on February 4th 2019. Since the problem described in this bug report should be resolved in oVirt 4.3.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |