Red Hat Bugzilla – Bug 1277591
ETL service sampling has encountered an error. Please consult the service log for more details
Last modified: 2015-11-05 10:51:15 EST
Description of problem:
Customers RHEV-M web console will begin to perform very slowly, eventually becoming unusable. Restarting the ovirt-engine-dwhd service will cause the web console to become responsive again. However, they are seeing "TL service sampling has encountered an error. Please consult the service log for more details" in their log files every 25 or so minutes. We also see "Can not sample data, oVirt Engine is not updating the statistics. Please check your oVirt Engine status.|9704" in the logs as well.
Verified the database locale is set to en_US.UTF8
Reporting in RHEV is not working.
Version-Release number of selected component (if applicable):
starting and stopping the dwhd service will alleviate the problem temporarily.
Reporting in RHEV not working, and performance eventually degrades in the web console.
Will collect logs when customer is availble to, and then attach to this BZ bug.
I have collected the logs using engine-log-collector. I will be happy to share the logs with anyone that has a @redhat.com email address. Do not want to attach directly to this publicly facing BZ, since it may contain customer data.
I will be happy to post sanitzied bits of the collected logs as needed.
There seem to be two different problems here:
1. RHEV webadmin is slow.
2. DWH log error on not being able to collect from engine.
I would guess that the first causes the second and not the other way around.
This error is produced means that the engine heartbeat for DWH did not run in over a minute, causing to not collect the stats. Also the engine service might be down. Are we sure the DWH is the one causing the slowness?
How much CPU\RAM is the DWH taking up?
I am not sure which causes which. the RHEV webamdin will work great for hours/days at a time, but will eventually slow and become unresponsive. By stopping the ovirt-engine-dwhd service the webadmin console will begin responding normally. The customer can leave the dwdh service stopped and the webadmin console will not slow.
This machine has 16 gig ram, and 4 procs. It is a virtual KVM guest on a RHEL6 host the customer is using specifically to run RHEV-M.
Many many thanks to Yaniv Dary for helping find a solution to this issue. There were three tables in the engine database that had their dates very far into the future (see below:)
lastSampling \N 2059-02-24 14:13:08.974-06
lastSync \N 2059-02-24 14:12:08-06
lastFullHostCheck \N 2059-02-24 14:12:08-06
Per Yaniv's suggestion I moved the dates of those three entires back to some time in the past (Jan 1st, 2000) so that DWHD would no long error every 30 seconds. Here is the postgresql statement I used:
UPDATE dwh_history_timekeeping SET var_datetime = '2000-01-01' WHERE var_name = 'lastSampling';
UPDATE dwh_history_timekeeping SET var_datetime = '2000-01-01' WHERE var_name = 'lastFullHostCheck';
UPDATE dwh_history_timekeeping SET var_datetime = '2000-01-01' WHERE var_name = 'lastSync';
I then restarted ovirt-engine-dwhd and let the data warehouse collect overnight. Reporting began working correctly.
Thank you very much to Yaniv for the help!