Bug 1277591 - ETL service sampling has encountered an error. Please consult the service log for more details
ETL service sampling has encountered an error. Please consult the service log...
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine-dwh (Show other bugs)
3.4.2
All Linux
unspecified Severity medium
: ---
: ---
Assigned To: Shirly Radco
Pavel Stehlik
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-11-03 10:33 EST by Lynn Dixon
Modified: 2015-11-05 10:51 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-11-05 10:51:15 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Lynn Dixon 2015-11-03 10:33:43 EST
Description of problem:
Customers RHEV-M web console will begin to perform very slowly, eventually becoming unusable. Restarting the ovirt-engine-dwhd service will cause the web console to become responsive again.  However, they are seeing "TL service sampling has encountered an error. Please consult the service log for more details" in their log files every 25 or so minutes.  We also see "Can not sample data, oVirt Engine is not updating the statistics. Please check your oVirt Engine status.|9704" in the logs as well. 
Verified the database locale is set to en_US.UTF8
Reporting in RHEV is not working.

Version-Release number of selected component (if applicable):
3.4.2-0.2.el6ev
rhevm-dwh-3.4.2-1.el6ev.noarch
rhevm-reports-3.4.2-1.el6ev.noarch


How reproducible:
starting and stopping the dwhd service will alleviate the problem temporarily. 


Actual results:
Reporting in RHEV not working, and performance eventually degrades in the web console.

Expected results:  


Additional info:
Will collect logs when customer is availble to, and then attach to this BZ bug.
Comment 1 Lynn Dixon 2015-11-03 12:06:21 EST
I have collected the logs using engine-log-collector.  I will be happy to share the logs with anyone that has a @redhat.com email address. Do not want to attach directly to this publicly facing BZ, since it may contain customer data.

I will be happy to post sanitzied bits of the collected logs as needed.
Comment 3 Yaniv Lavi (Dary) 2015-11-03 17:07:03 EST
There seem to be two different problems here:
1. RHEV webadmin is slow.
2. DWH log error on not being able to collect from engine.

I would guess that the first causes the second and not the other way around.
This error is produced means that the engine heartbeat for DWH did not run in over a minute, causing to not collect the stats. Also the engine service might be down. Are we sure the DWH is the one causing the slowness?
How much CPU\RAM is the DWH taking up?
Comment 4 Lynn Dixon 2015-11-04 10:36:43 EST
Yaniv,
I am not sure which causes which.  the RHEV webamdin will work great for hours/days at a time, but will eventually slow and become unresponsive.  By stopping the ovirt-engine-dwhd service the webadmin console will begin responding normally.  The customer can leave the dwdh service stopped and the webadmin console will not slow.

This machine has 16 gig ram, and 4 procs. It is a virtual KVM guest on a RHEL6 host the customer is using specifically to run RHEV-M.
Comment 5 Lynn Dixon 2015-11-05 10:51:15 EST
Many many thanks to Yaniv Dary for helping find a solution to this issue.  There were three tables in the engine database that had their dates very far into the future (see below:)

lastSampling    \N    2059-02-24 14:13:08.974-06
lastSync    \N    2059-02-24 14:12:08-06
lastFullHostCheck    \N    2059-02-24 14:12:08-06


Per Yaniv's suggestion I moved the dates of those three entires back to some time in the past (Jan 1st, 2000) so that DWHD would no long error every 30 seconds.  Here is the postgresql statement I used:

UPDATE dwh_history_timekeeping SET var_datetime = '2000-01-01' WHERE var_name = 'lastSampling';
UPDATE dwh_history_timekeeping SET var_datetime = '2000-01-01' WHERE var_name = 'lastFullHostCheck';
UPDATE dwh_history_timekeeping SET var_datetime = '2000-01-01' WHERE var_name = 'lastSync';

I then restarted ovirt-engine-dwhd and let the data warehouse collect overnight.  Reporting began working correctly. 

Thank you very much to Yaniv for the help!

Note You need to log in before you can comment on or make changes to this bug.