Bug 1277591

Summary: ETL service sampling has encountered an error. Please consult the service log for more details
Product: Red Hat Enterprise Virtualization Manager Reporter: Lynn Dixon <ldixon>
Component: ovirt-engine-dwhAssignee: Shirly Radco <sradco>
Status: CLOSED NOTABUG QA Contact: Pavel Stehlik <pstehlik>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.4.2CC: ecohen, gklein, ldixon, lsurette, rbalakri, Rhev-m-bugs, yeylon, ylavi
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-05 15:51:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lynn Dixon 2015-11-03 15:33:43 UTC
Description of problem:
Customers RHEV-M web console will begin to perform very slowly, eventually becoming unusable. Restarting the ovirt-engine-dwhd service will cause the web console to become responsive again.  However, they are seeing "TL service sampling has encountered an error. Please consult the service log for more details" in their log files every 25 or so minutes.  We also see "Can not sample data, oVirt Engine is not updating the statistics. Please check your oVirt Engine status.|9704" in the logs as well. 
Verified the database locale is set to en_US.UTF8
Reporting in RHEV is not working.

Version-Release number of selected component (if applicable):
3.4.2-0.2.el6ev
rhevm-dwh-3.4.2-1.el6ev.noarch
rhevm-reports-3.4.2-1.el6ev.noarch


How reproducible:
starting and stopping the dwhd service will alleviate the problem temporarily. 


Actual results:
Reporting in RHEV not working, and performance eventually degrades in the web console.

Expected results:  


Additional info:
Will collect logs when customer is availble to, and then attach to this BZ bug.

Comment 1 Lynn Dixon 2015-11-03 17:06:21 UTC
I have collected the logs using engine-log-collector.  I will be happy to share the logs with anyone that has a @redhat.com email address. Do not want to attach directly to this publicly facing BZ, since it may contain customer data.

I will be happy to post sanitzied bits of the collected logs as needed.

Comment 3 Yaniv Lavi 2015-11-03 22:07:03 UTC
There seem to be two different problems here:
1. RHEV webadmin is slow.
2. DWH log error on not being able to collect from engine.

I would guess that the first causes the second and not the other way around.
This error is produced means that the engine heartbeat for DWH did not run in over a minute, causing to not collect the stats. Also the engine service might be down. Are we sure the DWH is the one causing the slowness?
How much CPU\RAM is the DWH taking up?

Comment 4 Lynn Dixon 2015-11-04 15:36:43 UTC
Yaniv,
I am not sure which causes which.  the RHEV webamdin will work great for hours/days at a time, but will eventually slow and become unresponsive.  By stopping the ovirt-engine-dwhd service the webadmin console will begin responding normally.  The customer can leave the dwdh service stopped and the webadmin console will not slow.

This machine has 16 gig ram, and 4 procs. It is a virtual KVM guest on a RHEL6 host the customer is using specifically to run RHEV-M.

Comment 5 Lynn Dixon 2015-11-05 15:51:15 UTC
Many many thanks to Yaniv Dary for helping find a solution to this issue.  There were three tables in the engine database that had their dates very far into the future (see below:)

lastSampling    \N    2059-02-24 14:13:08.974-06
lastSync    \N    2059-02-24 14:12:08-06
lastFullHostCheck    \N    2059-02-24 14:12:08-06


Per Yaniv's suggestion I moved the dates of those three entires back to some time in the past (Jan 1st, 2000) so that DWHD would no long error every 30 seconds.  Here is the postgresql statement I used:

UPDATE dwh_history_timekeeping SET var_datetime = '2000-01-01' WHERE var_name = 'lastSampling';
UPDATE dwh_history_timekeeping SET var_datetime = '2000-01-01' WHERE var_name = 'lastFullHostCheck';
UPDATE dwh_history_timekeeping SET var_datetime = '2000-01-01' WHERE var_name = 'lastSync';

I then restarted ovirt-engine-dwhd and let the data warehouse collect overnight.  Reporting began working correctly. 

Thank you very much to Yaniv for the help!