Bug 962177 - [rhevm-dwh] - ETL Reports error when a Single Host in setup is Non-Responsive ("ETL service sampling has encountered an error")
[rhevm-dwh] - ETL Reports error when a Single Host in setup is Non-Responsive...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.2.0
Unspecified Unspecified
unspecified Severity medium
: ---
: 3.3.0
Assigned To: Martin Perina
Barak Dagan
infra
:
: 1001762 (view as bug list)
Depends On:
Blocks: 1008370
  Show dependency treegraph
 
Reported: 2013-05-12 06:33 EDT by David Botzer
Modified: 2016-02-10 14:10 EST (History)
19 users (show)

See Also:
Fixed In Version: is18
Doc Type: Bug Fix
Doc Text:
Previously, the data warehouse would assume that the engine was not running when all hosts registered in the engine were non-resposive because the engine would not update their data. This update introduces a data warehouse heartbeat job that lets the data warehouse know that the engine is active even when all hosts are in a non-responsive state. The heartbeat job functions by periodically updating the status in the database to notify the data warehouse that the engine is active. The interval for updating the heartbeat can be configured via engine-config using the DwhHeartBeatInterval variable.
Story Points: ---
Clone Of:
: 1008370 (view as bug list)
Environment:
Last Closed: 2014-01-21 12:21:10 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
dwh-log (1.85 MB, application/octet-stream)
2013-05-12 06:33 EDT, David Botzer
no flags Details
all-logs (616.97 KB, application/x-gzip)
2013-05-12 06:34 EDT, David Botzer
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 413393 None None None Never
oVirt gerrit 19366 None None None Never
oVirt gerrit 19367 None None None Never
oVirt gerrit 19827 None None None Never
oVirt gerrit 19828 None None None Never

  None (edit)
Description David Botzer 2013-05-12 06:33:41 EDT
Created attachment 746831 [details]
dwh-log

Description of problem:
ETL Reports error when a Single Host in setup is Non-Responsive ("ETL service sampling has encountered an error")

Version-Release number of selected component (if applicable):
3.2/sf15

How reproducible:
always

Steps to Reproduce:
1.install rhevm+dwh+reports
2.Create couple of VMs
3.Shutdown the HOST physically, or unplug the power cord (to reach " host status "Non-Responsive")
  
Actual results:
VMs are still running, and etl Reports error ("ETL service sampling has encountered an error")

Expected results:
Should retrieve data from all "Up" statuses entities

Additional info:
Logs
Comment 1 David Botzer 2013-05-12 06:34:04 EDT
Yaniv:
This may happen because rhevm doesn't update host statistics table when a single host is non responsive.
Comment 2 David Botzer 2013-05-12 06:34:23 EDT
Created attachment 746832 [details]
all-logs
Comment 3 Yaniv Lavi 2013-05-12 07:48:24 EDT
Eli, do we update the host stats table in this staution?



Yaniv
Comment 4 Eli Mesika 2013-05-13 21:39:08 EDT
(In reply to comment #3)
> Eli, do we update the host stats table in this staution?
> 
> 
> 
> Yaniv

I see nothing in the code that blocks the Host stat updates when its status is not responding ...
Comment 5 Yaniv Lavi 2013-05-16 06:53:09 EDT
(In reply to comment #4)
> (In reply to comment #3)
> > Eli, do we update the host stats table in this staution?
> > 
> > 
> > 
> > Yaniv
> 
> I see nothing in the code that blocks the Host stat updates when its status
> is not responding ...

Can it be that vdsm doesn't send stats for non responsive hosts or something in these lines?
Comment 6 Barak 2013-06-06 08:28:35 EDT
(In reply to Yaniv Dary from comment #5)
> Can it be that vdsm doesn't send stats for non responsive hosts or something
> in these lines?

1 - VDSM is being polled (it doesn't send anything)
2 - none-responsive means there is no communication to the host 

This has never stopped engine from updating the rest of the hosts stats
And DWH should not stop collecting data in those areas.

the scenario is easy to reproduce,
Yaniv can you please try reproducing on your environment ?
Comment 7 Barak 2013-06-09 07:45:15 EDT
After discussing the issue with Yaniv,
This issue happens only when all hosts in the system are none-responsive.

Although less interesting, still this needs to be handled on the engine side regardless of the hosts statuses.

One option to solve it is to update a variable in dwh_history_timekeeping table everytime the engine start the polling from VDSM.

In the past we have tried doing that but we tried to use the same transactions used by the stats collection, which caused db locking issues.

The fact is that all the DWH needs is to understand that the engine is up.
So the most basic approach is simply to have a quartz job that function as heart beat and update the variable mentioned above, to enable DWH to understand that the engine is actually up and to produce a record for all hosts (even when none-responsive)

This Heart Beat frequency should be configurable in the vdc_options.
Comment 8 Barak 2013-06-09 07:46:51 EDT
Alon, Yair,

Please see comment #7
Comment 9 Alon Bar-Lev 2013-06-09 07:55:23 EDT
(In reply to Barak from comment #7)
> The fact is that all the DWH needs is to understand that the engine is up.
> So the most basic approach is simply to have a quartz job that function as
> heart beat and update the variable mentioned above, to enable DWH to
> understand that the engine is actually up and to produce a record for all
> hosts (even when none-responsive)

we discussed this when discussed bug#918039 comment#2...

I don't like blind watchdogs... but it is better than what we have anyway...
Comment 10 Yair Zaslavsky 2013-06-09 10:29:01 EDT
Barak, from pure Java/Java EE point of view, seems like doable thing.
As I understand we need to modify one value in a table that resides at engine.
No problem with adding a scheduler for this.
Comment 13 Richard Davis 2013-08-19 13:00:29 EDT
"ETL service sampling has encountered an error"

This error is also encountered if your language is not configured as "en_US.UTF8".
(For instance "en_GB.UTF-8")

Please see RHN support case 00898361 or speak to RH TSA Rui Gouveia for more details.

Thank You
Comment 14 wdaniel 2013-08-26 12:38:49 EDT
Team,

I recently attached a case to this, however the customer is running the most up to date rhevm packages, and we verified the locale settings in the database (As per the KCS https://access.redhat.com/site/solutions/413393 - is there anything else I should checkm or maybe create a new bug? Thanks
Comment 15 Alon Bar-Lev 2013-08-26 12:45:02 EDT
(In reply to wdaniel from comment #14)
> Team,
> 
> I recently attached a case to this, however the customer is running the most
> up to date rhevm packages, and we verified the locale settings in the
> database (As per the KCS https://access.redhat.com/site/solutions/413393 -
> is there anything else I should checkm or maybe create a new bug? Thanks

If you unsure open a new bug, attach logs, worse case your bug will be marked as duplicate of this or any other bug.
Comment 16 Yaniv Lavi 2013-09-01 07:29:50 EDT
*** Bug 1001762 has been marked as a duplicate of this bug. ***
Comment 18 Richard Davis 2013-09-17 09:11:52 EDT
BZ 1005132 raised to cover Comment 13
Comment 21 Barak Dagan 2013-10-31 08:36:51 EDT
Verified on IS20.1

[RHEVM shell (connected)]# show host aqua08 | grep status
status-state                      : non_responsive

# tail -f /var/log/ovirt-engine/ovirt-engine-dwhd.log 
2013-10-31 10:05:56|ETL Service Started
Comment 22 Charlie 2013-11-27 19:21:44 EST
This bug is currently attached to errata RHEA-2013:15231. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.
Comment 24 errata-xmlrpc 2014-01-21 12:21:10 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0038.html

Note You need to log in before you can comment on or make changes to this bug.