Description of problem: Just installed a fresh engine on a fresh CentOS using ovirt-release-master. After engine-setup, connected to the admin interface, it shows 1 error: "ETL service sampling has encountered an error. Please consult the service log for more details." I assume "service log" refers to "/var/log/ovirt-engine-dwh/ovirt-engine-dwhd.log", the phrasing here can be improved, please consider opening a bug for it. The log contains: 2017-10-27 15:16:58|ETL Service Started ovirtEngineDbDriverClass|org.postgresql.Driver ovirtEngineHistoryDbJdbcConnection|jdbc:postgresql://localhost:5432/ovirt_engine_history?sslfactory=org.postgresql.ssl.NonValidatingFactory hoursToKeepDaily|0 hoursToKeepHourly|720 ovirtEngineDbPassword|********************** runDeleteTime|3 ovirtEngineDbJdbcConnection|jdbc:postgresql://localhost:5432/engine?sslfactory=org.postgresql.ssl.NonValidatingFactory runInterleave|60 limitRows|limit 1000 ovirtEngineHistoryDbUser|ovirt_engine_history ovirtEngineDbUser|engine deleteIncrement|10 timeBetweenErrorEvents|300000 hoursToKeepSamples|24 deleteMultiplier|1000 lastErrorSent|2011-07-03 12:46:47.000000 etlVersion|4.2.1 dwhAggregationDebug|false dwhUuid|0343f512-ed12-4882-8369-37fd3ab1ef97 ovirtEngineHistoryDbDriverClass|org.postgresql.Driver ovirtEngineHistoryDbPassword|********************** 2017-10-27 15:17:00|rtV5Cj|tHpOYa|dpjN55|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|Default|5|tWarn|tWarn_1|Can not sample data, oVirt Engine is not updating the statistics. Please check your oVirt Engine status.|9704 The error is reported at 15:17:00 while the first line in engine.log is at 15:17:12: 2017-10-27 15:17:12,860+02 INFO [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 48) [] Loaded file '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'. And server.log first log is at 15:17:01 2017-10-27 15:17:01,187+02 INFO [org.jboss.remoting] (MSC service thread 1-1) JBoss Remoting version 5.0.5.Final So, looks like DWH is started at least 12 seconds before the engine start producing data. We should wait for the engine to be up before starting DWH. Version-Release number of selected component (if applicable): # rpm -qa |grep ovirt-engine ovirt-engine-setup-plugin-websocket-proxy-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch ovirt-engine-dwh-4.2.1-0.2.master.20171026133424.el7.centos.noarch ovirt-engine-setup-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch ovirt-engine-restapi-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch ovirt-engine-backend-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch ovirt-engine-sdk-python-3.6.9.2-0.1.20161204.gite99bbd1.el7.centos.noarch ovirt-engine-lib-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch ovirt-engine-setup-base-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch ovirt-engine-wildfly-11.0.0-1.el7.centos.x86_64 ovirt-engine-cli-3.6.9.3-0.1.20170116.gitad0561a.el7.centos.noarch ovirt-engine-setup-plugin-ovirt-engine-common-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch ovirt-engine-tools-backup-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch ovirt-engine-websocket-proxy-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch ovirt-engine-dwh-setup-4.2.1-0.2.master.20171026133424.el7.centos.noarch ovirt-engine-webadmin-portal-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch ovirt-engine-vmconsole-proxy-helper-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch ovirt-engine-setup-plugin-ovirt-engine-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch ovirt-engine-tools-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch ovirt-engine-api-explorer-0.0.1-0.0.20171013git1fbdc17.el7.centos.noarch ovirt-engine-dbscripts-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch ovirt-engine-extensions-api-impl-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch ovirt-engine-wildfly-overlay-11.0.1-1.el7.centos.noarch ovirt-engine-dashboard-1.2.0-0.5.20171012git78f7a28.el7.centos.noarch ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch ovirt-engine-metrics-1.1.1-0.0.master.20171019113406.el7.centos.noarch ovirt-engine-4.2.0-0.0.master.20171026172343.git582bf7c.el7.centos.noarch ovirt-engine-extension-aaa-jdbc-1.1.7-0.0.master.git1660a27.el7.centos.noarch python-ovirt-engine-sdk4-4.2.1-1.a2.20171023gita89e6a9.el7.centos.x86_64 How reproducible: I tried only once Steps to Reproduce: 1. Install and setup oVirt Engine with DWH 2. Login to administration interface 3. See the error Actual results: Error is displayed after a clean setup Expected results: No errors after setup Additional info: See attached setup log for the sequence used by setup.
Created attachment 1344308 [details] setup logs
(In reply to Sandro Bonazzola from comment #0) > Description of problem: > Just installed a fresh engine on a fresh CentOS using ovirt-release-master. > After engine-setup, connected to the admin interface, it shows 1 error: > "ETL service sampling has encountered an error. Please consult the service > log for more details." > > I assume "service log" refers to > "/var/log/ovirt-engine-dwh/ovirt-engine-dwhd.log", the phrasing here can be > improved, please consider opening a bug for it. Agreed. > > The log contains: > 2017-10-27 15:16:58|ETL Service Started > ovirtEngineDbDriverClass|org.postgresql.Driver > ovirtEngineHistoryDbJdbcConnection|jdbc:postgresql://localhost:5432/ > ovirt_engine_history?sslfactory=org.postgresql.ssl.NonValidatingFactory > hoursToKeepDaily|0 > hoursToKeepHourly|720 > ovirtEngineDbPassword|********************** > runDeleteTime|3 > ovirtEngineDbJdbcConnection|jdbc:postgresql://localhost:5432/ > engine?sslfactory=org.postgresql.ssl.NonValidatingFactory > runInterleave|60 > limitRows|limit 1000 > ovirtEngineHistoryDbUser|ovirt_engine_history > ovirtEngineDbUser|engine > deleteIncrement|10 > timeBetweenErrorEvents|300000 > hoursToKeepSamples|24 > deleteMultiplier|1000 > lastErrorSent|2011-07-03 12:46:47.000000 > etlVersion|4.2.1 > dwhAggregationDebug|false > dwhUuid|0343f512-ed12-4882-8369-37fd3ab1ef97 > ovirtEngineHistoryDbDriverClass|org.postgresql.Driver > ovirtEngineHistoryDbPassword|********************** > 2017-10-27 > 15:17: > 00|rtV5Cj|tHpOYa|dpjN55|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|Default|5|tWarn > |tWarn_1|Can not sample data, oVirt Engine is not updating the statistics. > Please check your oVirt Engine status.|9704 > > The error is reported at 15:17:00 while the first line in engine.log is at > 15:17:12: > 2017-10-27 15:17:12,860+02 INFO > [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread > Pool -- 48) [] Loaded file > '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'. > > And server.log first log is at 15:17:01 > 2017-10-27 15:17:01,187+02 INFO [org.jboss.remoting] (MSC service thread > 1-1) JBoss Remoting version 5.0.5.Final > > So, looks like DWH is started at least 12 seconds before the engine start > producing data. > > We should wait for the engine to be up before starting DWH. I do not agree. This requires needlessly to make engine-setup wait and take longer. An alternative solution might be to make dwh emit such errors only after a certain amount of time since it was started - say, 1 minute or so. Shirly, what do you think?
I agree. Moving this to me. You can abandon the patch.
After discussion with Yaniv Lavi, this needs to be handled during the engine setup since we do not want dwh to wait after every restart. Moving this back to Ido. Yaniv, please ack.
Don't we have (potentially) the same issue on system reboot? That dwh will start before the engine started, or at least before the engine has data? Even unrelated to timing, I do not think we want to emit false alarms at all. Doing this risks loosing users' attention when a real problem happened. If dwh checks every 60 seconds, and for one time didn't see updates, I do not think we should alert users. I'd say, we want something like this, in pseudo-python: min_time_for_alerts = 200 # seconds time_between_samples = 60 # seconds last_good_sample_time = now() while true; do try_to_sample if found data: last_good_sampe_time = now() process data else: if now() - last_good_sample_time > min_time_for_alerts then: alert sleep(time_between_samples) That is, the alerts should not interfere with the sampling loop/logic. Also 200 does not need to be divisible by 60.
(In reply to Yedidyah Bar David from comment #5) > Don't we have (potentially) the same issue on system reboot? That dwh will > start before the engine started, or at least before the engine has data? > > Even unrelated to timing, I do not think we want to emit false alarms at > all. Doing this risks loosing users' attention when a real problem happened. > > If dwh checks every 60 seconds, and for one time didn't see updates, I do > not think we should alert users. I'd say, we want something like this, in > pseudo-python: > > min_time_for_alerts = 200 # seconds > time_between_samples = 60 # seconds > last_good_sample_time = now() > while true; do > try_to_sample > if found data: > last_good_sampe_time = now() > process data > else: > if now() - last_good_sample_time > min_time_for_alerts then: > alert > sleep(time_between_samples) > > That is, the alerts should not interfere with the sampling loop/logic. Also > 200 does not need to be divisible by 60. We have a anti-flood mechanism already in DWH if errors happen every sample.
(In reply to Yaniv Lavi from comment #6) > We have a anti-flood mechanism already in DWH if errors happen every sample. Good, so should we fine-tune it also for current case?
(In reply to Yedidyah Bar David from comment #7) > (In reply to Yaniv Lavi from comment #6) > > We have a anti-flood mechanism already in DWH if errors happen every sample. > > Good, so should we fine-tune it also for current case? I'm not sure since we don't want to block every service start, only on the boot and we have no way to know the difference. We could lose debug info we will need if we block error on service start for 20 seconds for example.
(In reply to Yaniv Lavi from comment #8) > (In reply to Yedidyah Bar David from comment #7) > > (In reply to Yaniv Lavi from comment #6) > > > We have a anti-flood mechanism already in DWH if errors happen every sample. > > > > Good, so should we fine-tune it also for current case? > > I'm not sure since we don't want to block every service start, only on the > boot and we have no way to know the difference. > We could lose debug info we will need if we block error on service start for > 20 seconds for example. Not sure which debug info. I am not suggesting to block logging _errors_, only this specific alert "Can not sample data, oVirt Engine is not updating...". It's not accompanied by any other data, AFAICT. If it happens more than a few times, we'll log something eventually. If you want to know exactly when it started, you can do that in the alert in comment 5 - alert "did not get new data since $last_good_sample_time".
No ETL error after fresh install and reboot verified in ovirt-engine-dwh-4.3.0-0.0.master.20180705143342.el7.noarch
This bugzilla is included in oVirt 4.2.7 release, published on November 2nd 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.7 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.
Closed by mistake, moving back to qa -> verified
This bugzilla is included in oVirt 4.3.0 release, published on February 4th 2019. Since the problem described in this bug report should be resolved in oVirt 4.3.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.