Bug 1076902 - RHEVM shows an event message of ETL service sampling has encountered an error
Summary: RHEVM shows an event message of ETL service sampling has encountered an error
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine-dwh
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.4.0
Assignee: Shirly Radco
QA Contact: Barak Dagan
URL:
Whiteboard: infra
: 1073529 1077714 (view as bug list)
Depends On:
Blocks: rhev3.4snap1
TreeView+ depends on / blocked
 
Reported: 2014-03-16 11:26 UTC by Lev Veyde
Modified: 2019-06-13 08:01 UTC (History)
15 users (show)

Fixed In Version: av4.1 - rhevm-dwh-3.4.0-6.el6ev.noarch.rpm
Doc Type: Bug Fix
Doc Text:
Previously, after installation of rhevm-reports and rhevm-dwh the Red Hat Enterprise Virtualization Manager web interface periodically showed the following event message: "ETL service sampling has encountered an error. Please consult the service log for more details". Now, the 'mem_shared' column type and name have been changed, and event message is not triggered.
Clone Of:
Environment:
Last Closed: 2014-06-09 15:18:39 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
DWH log (2.40 KB, application/x-tar-gz)
2014-03-16 11:26 UTC, Lev Veyde
no flags Details
dwh log (101.34 KB, text/x-log)
2014-03-25 15:34 UTC, Barak Dagan
no flags Details
dwh + serer log (136.21 KB, application/x-compressed-tar)
2014-03-27 12:37 UTC, Barak Dagan
no flags Details
dwh av5 log (520 bytes, application/x-compressed-tar)
2014-03-30 09:01 UTC, Barak Dagan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:0601 0 normal SHIPPED_LIVE rhevm-dwh 3.4 bug fix and enhancement update 2014-06-09 19:15:53 UTC
oVirt gerrit 25787 0 None MERGED history: Changed mem_shared column type and name 2020-02-04 22:31:02 UTC
oVirt gerrit 25789 0 None MERGED history: updated ksm_shared_memory_percent to mb 2020-02-04 22:31:02 UTC
oVirt gerrit 25795 0 None MERGED history: Changed mem_shared column type and name 2020-02-04 22:31:02 UTC
oVirt gerrit 25812 0 None MERGED history: Changed mem_shared column type and name 2020-02-04 22:31:02 UTC
oVirt gerrit 26215 0 None MERGED etl: fixed generated code issue 2020-02-04 22:31:02 UTC

Description Lev Veyde 2014-03-16 11:26:13 UTC
Created attachment 875104 [details]
DWH log

Description of problem:
After installing rhevm-reports/dwh the RHEVM periodically shows the following event message:
ETL service sampling has encountered an error. Please consult the service log for more details.

ovirt-engine-dwhd.log contains messages about NullPointerException.

Version-Release number of selected component (if applicable):

rhevm-dwh-3.4.0-3.el6ev.noarch
rhevm-reports-3.4.0-2.el6ev.noarch
rhevm-3.4.0-0.5.master.el6ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. install RHEV 3.4
2. setup RHEV
3. install Reports for RHEV 3.4
4. setup Reports
5. login into RHEVM

Actual results:
Error messages appear.

Expected results:
System must function without errors.

Additional info:

Comment 1 Shirly Radco 2014-03-17 09:25:58 UTC
The value of mem_shared column in vds_statistics table in the engine
is in MB and should be of int type.

But in the view it is casted as smallint and called "ksm_shared_memory_percent"
It should be called "ksm_shared_memory_mb".

Comment 2 Yaniv Lavi 2014-03-19 19:48:36 UTC
*** Bug 1073529 has been marked as a duplicate of this bug. ***

Comment 3 Yaniv Lavi 2014-03-19 19:49:04 UTC
*** Bug 1077714 has been marked as a duplicate of this bug. ***

Comment 4 Barak Dagan 2014-03-20 08:04:17 UTC
Have similar issue in 3.3.2, should it be cloned or open as a new one?

select * from information_schema.colselect * from information_schema.columns where table_name='vds_statistics' and column_name = 'mem_shared'umns where table_name='vds_statistics' and column_name = 'mem_shared'

 table_catalog | table_schema |   table_name   | column_name | ordinal_position | column_default | is_nullable | data_type | character_maximum_length | character_octet_length | numeric_pr
---------------+--------------+----------------+-------------+------------------+----------------+-------------+-----------+--------------------------+------------------------+-----------
 engine        | public       | vds_statistics | mem_shared  |               10 |                | YES         | bigint    |                          |                        |           
(1 row)

select * from information_schema.columns where table_name='vds' and column_name = 'mem_shared

 table_catalog | table_schema | table_name | column_name | ordinal_position | column_default | is_nullable | data_type | character_maximum_length | character_octet_length | numeric_precis
---------------+--------------+------------+-------------+------------------+----------------+-------------+-----------+--------------------------+------------------------+---------------
 engine        | public       | vds        | mem_shared  |               67 |                | YES         | bigint    |                          |                        |               
(1 row)


select * from information_schema.columns where table_name='vds' and column_name like '%ksm%'
 table_catalog | table_schema | table_name |   column_name   | ordinal_position | column_default | is_nullable | data_type | character_maximum_length | character_octet_length | numeric_pr
---------------+--------------+------------+-----------------+------------------+----------------+-------------+-----------+--------------------------+------------------------+-----------
 engine        | public       | vds        | ksm_cpu_percent |               70 |                | YES         | integer   |                          |                        |           
 engine        | public       | vds        | ksm_pages       |               71 |                | YES         | bigint    |                          |                        |           
 engine        | public       | vds        | ksm_state       |               72 |                | YES         | boolean   |                          |                        |           
(3 rows)

Comment 5 Barak 2014-03-20 09:07:15 UTC
Barak, The mem_shared field was added in 3.4,
I don't understand how did you encounter it on 3.3.2,

Please recheck.

Comment 6 Barak Dagan 2014-03-20 10:05:39 UTC
# rpm -qa | egrep 'rhevm-3|dwh|reports'
rhevm-3.3.2-0.49.el6ev.noarch
rhevm-dwh-3.3.2-1.el6ev.noarch
jasperreports-server-pro-5.5.0-6.el6ev.noarch
rhevm-reports-3.3.2-3.el6ev.noarch


engine=# select column_name from information_schema.columns where table_name='vds' and (column_name like '%ksm%' or column_name like '%mem%');
        column_name         
----------------------------
 physical_mem_mb
 pending_vmem_size
 mem_commited
 max_vds_memory_over_commit
 reserved_mem
 usage_mem_percent
 mem_available
 mem_free
 mem_shared
 ksm_cpu_percent
 ksm_pages
 ksm_state
(12 rows)

Comment 7 Barak 2014-03-20 12:16:50 UTC
This is the engine db, not the history.

Comment 8 Shirly Radco 2014-03-20 12:20:51 UTC
Barak D, Is this field also in the dwh views in the engine db? The view is called "vds".

Comment 9 Barak Dagan 2014-03-20 12:30:56 UTC
Is that answer your questio, Shirly ?

engine=# select column_name, data_type from information_schema.columns where table_name='vds' and (column_name like '%ksm%' or column_name like '%mem%');
        column_name         | data_type 
----------------------------+-----------
 physical_mem_mb            | integer
 pending_vmem_size          | integer
 mem_commited               | integer
 max_vds_memory_over_commit | integer
 reserved_mem               | integer
 usage_mem_percent          | integer
 mem_available              | bigint
 mem_free                   | bigint
 mem_shared                 | bigint
 ksm_cpu_percent            | integer
 ksm_pages                  | bigint
 ksm_state                  | boolean
(12 rows)

Comment 10 Shirly Radco 2014-03-20 13:05:43 UTC
You are referring to the engine DB.

We had the problem in the engine on 3.4 because we added to the dwh view,
cast( mem_shared as smallint), and it is supposed to stay as bigint.

We are not supposed to have it on 3.3.

I don't see a problem here.

Comment 11 Yaniv Lavi 2014-03-20 17:18:52 UTC
*** Bug 1073529 has been marked as a duplicate of this bug. ***

Comment 14 Barak Dagan 2014-03-25 15:34:03 UTC
Created attachment 878527 [details]
dwh log

engine=> select * from dwh_history_timekeeping;
     var_name      | var_value |         var_datetime          
-------------------+-----------+-------------------------------
 heartBeat         |           | 2014-03-25 17:33:10.072+02
 lastOsinfoSync    |           | 2014-03-24 17:13:23.46+02
 lastErrorSent     |           | 2014-03-25 17:33:12.288+02
 timesFailed       | 23        | 
 lastSampling      |           | 2014-03-25 19:06:24.635+02
 lastSync          |           | 2014-03-25 19:05:24+02
 lastFullHostCheck |           | 2014-03-25 19:05:24+02
 lastOsinfoUpdate  |           | 2014-03-24 17:13:23.460384+02
(8 rows)

Comment 15 Shirly Radco 2014-03-25 16:28:45 UTC
Barak, we checked with Barak Dagan. This is a known issue in 3.3 and unrelated.
Please remove z-stream flag.

Comment 16 Barak Dagan 2014-03-26 11:28:24 UTC
verification failed av4:

# rpm -q rhevm-dwh
rhevm-dwh-3.4.0-3.el6ev.noarch

could not change directory to "/root"
 table_catalog | table_schema |      table_name       |     column_name      | ordinal_position | column_default | is_nullable | data_type | character_maximum_length | character
---------------+--------------+-----------------------+----------------------+------------------+----------------+-------------+-----------+--------------------------+----------
 engine        | public       | dwh_host_history_view | ksm_cpu_percent      |                5 |                | YES         | smallint  |                          |          
 engine        | public       | dwh_host_history_view | ksm_shared_memory_mb |               13 |                | YES         | bigint    |                          |          
(2 rows)


2014-03-26 12:41:34|JodAHz|p20RrR|LqrhGM|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|Default|5|tWarn|tWarn_1|Can not sample data, oVirt Engine is not updating the statistics. Please check your oVirt Engine status.|9704
2014-03-26 12:56:34|1Q9SjX|p20RrR|LqrhGM|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|Default|5|tWarn|tWarn_1|Can not sample data, oVirt Engine is not updating the statistics. Please check your oVirt Engine status.|9704
2014-03-26 13:11:34|rFdC1Z|p20RrR|LqrhGM|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|Default|5|tWarn|tWarn_1|Can not sample data, oVirt Engine is not updating the statistics. Please check your oVirt Engine status.|9704
2014-03-26 13:26:34|2W5cZn|p20RrR|LqrhGM|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|Default|5|tWarn|tWarn_1|Can not sample data, oVirt Engine is not updating the statistics. Please check your oVirt Engine status.|9704

Comment 17 Yaniv Lavi 2014-03-26 12:04:50 UTC
The error this is refering to is:
"ETL service sampling has encountered an error "
The reason you are failing this bug on is:
"oVirt Engine is not updating the statistics"

Please read the bug and understand the issue.
The warning you are failing on is a warning, not a error, caused when ovirt-engine is not updating stats for some reason. If you stop engine for any reason, like upgrade or service stop this will appear in the log.


Moving back to ON_QA.



Yaniv

Comment 18 Barak Dagan 2014-03-27 12:37:07 UTC
Created attachment 879445 [details]
dwh + serer log

Verified on AV4:

rhevm-reports-3.4.0-2.el6ev.noarch
rhevm-dwh-3.4.0-3.el6ev.noarch
jasperreports-server-pro-5.5.0-8.el6ev.noarch
rhevm-3.4.0-0.10.beta2.el6ev.noarch


psql -d engine -c "select column_name, data_type from information_schema.columns where table_name='dwh_host_history_view' and column_name like '%ksm%' ;" | less -S

     column_name      | data_type 
----------------------+-----------
 ksm_cpu_percent      | smallint
 ksm_shared_memory_mb | bigint
(2 rows)

psql -d engine_history -c "select column_name, data_type from information_schema.columns where table_name='v3_4_statistics_hosts_resources_usage_samples' and column_name like '%ksm%' ;" | less -S


could not change directory to "/root"
        column_name        | data_type 
---------------------------+-----------
 ksm_shared_memory_percent | smallint
 ksm_cpu_percent           | smallint
(2 rows)


psql -d engine_history -c "select column_name, data_type from information_schema.columns where table_name='v3_4_statistics_hosts_resources_usage_hourly' and column_name like '%ksm%' ;" | less -S


          column_name          | data_type 
-------------------------------+-----------
 ksm_shared_memory_percent     | smallint
 max_ksm_shared_memory_percent | smallint
 ksm_cpu_percent               | smallint
 max_ksm_cpu_percent           | smallint
(4 rows)


Seems that http://gerrit.ovirt.org/25789 is not in. 
Is it fixed ?
Can it explain c#16 ? 
2014-03-27 12:54:34|hAhLyS|iQRp3f|BzPXUZ|OVIRT_ENGINE_DWH|SampleRunJobs|Default|6|Java Exception|tRunJob_5|java.lang.RuntimeException:Child job running failed|1
Exception in component tRunJob_1
java.lang.RuntimeException: Child job running failed
	at ovirt_engine_dwh.sampletimekeepingjob_3_4.SampleTimeKeepingJob.tRunJob_1Process(Unknown Source)
	at ovirt_engine_dwh.sampletimekeepingjob_3_4.SampleTimeKeepingJob.tJDBCInput_2Process(Unknown Source)
	at ovirt_engine_dwh.sampletimekeepingjob_3_4.SampleTimeKeepingJob.tJDBCConnection_1Process(Unknown Source)
	at ovirt_engine_dwh.sampletimekeepingjob_3_4.SampleTimeKeepingJob.tJDBCConnection_2Process(Unknown Source)
	at ovirt_engine_dwh.sampletimekeepingjob_3_4.SampleTimeKeepingJob.tRowGenerator_2Process(Unknown Source)
	at ovirt_engine_dwh.sampletimekeepingjob_3_4.SampleTimeKeepingJob.tJDBCInput_3Process(Unknown Source)
	at ovirt_engine_dwh.sampletimekeepingjob_3_4.SampleTimeKeepingJob.tJDBCInput_5Process(Unknown Source)
	at ovirt_engine_dwh.sampletimekeepingjob_3_4.SampleTimeKeepingJob.tJDBCInput_4Process(Unknown Source)
	at ovirt_engine_dwh.sampletimekeepingjob_3_4.SampleTimeKeepingJob.tJDBCConnection_3Process(Unknown Source)
	at ovirt_engine_dwh.sampletimekeepingjob_3_4.SampleTimeKeepingJob$2.run(Unknown Source)
2014-03-27 13:09:34|BzPXUZ|iQRp3f|C1EYd9|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|Default|6|Java Exception|tRunJob_1|java.lang.RuntimeException:Child job running failed|1
2014-03-27 13:24:34|Qttsg9|iQRp3f|C1EYd9|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|Default|5|tWarn|tWarn_1|Can not sample data, oVirt Engine is not updating the statistics. Please check your oVirt Engine status.|9704


Log attached.

Comment 19 Barak Dagan 2014-03-27 15:01:38 UTC
ON av5:

2014-03-19 14:25:00|w35Z39|NTMNnU|U42Rbh|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|Default|5|tWarn|tWarn_1|Can not sample data, oVirt Engine is not updating the statistics. Please check your oVirt Engine status.|9704
2014-03-27 14:59:20|ETL Service Stopped
2014-03-27 15:22:22|ETL Service Started
Warning:the operation 'max' for the output column 'max_ksm_shared_memory_mb' can't be processed because of incompatible input and/or output types

Is it a new Bug blocking the current one, or should this one be re-assign ?

Comment 20 Yaniv Lavi 2014-03-30 05:42:22 UTC
(In reply to Barak Dagan from comment #19)
> ON av5:
> 
> 2014-03-19
> 14:25:
> 00|w35Z39|NTMNnU|U42Rbh|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|Default|5|tWarn
> |tWarn_1|Can not sample data, oVirt Engine is not updating the statistics.
> Please check your oVirt Engine status.|9704
> 2014-03-27 14:59:20|ETL Service Stopped
> 2014-03-27 15:22:22|ETL Service Started
> Warning:the operation 'max' for the output column 'max_ksm_shared_memory_mb'
> can't be processed because of incompatible input and/or output types
> 
> Is it a new Bug blocking the current one, or should this one be re-assign ?

Please attach full log.

Comment 21 Barak Dagan 2014-03-30 09:01:44 UTC
Created attachment 880266 [details]
dwh av5 log

Comment 22 Barak Dagan 2014-03-30 09:03:11 UTC
(In reply to Shirly Radco from comment #15)
> Barak, we checked with Barak Dagan. This is a known issue in 3.3 and
> unrelated.
> Please remove z-stream flag.

Comment 24 Yaniv Lavi 2014-03-30 11:55:03 UTC
(In reply to Barak Dagan from comment #19)
> ON av5:
> 
> 2014-03-19
> 14:25:
> 00|w35Z39|NTMNnU|U42Rbh|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|Default|5|tWarn
> |tWarn_1|Can not sample data, oVirt Engine is not updating the statistics.
> Please check your oVirt Engine status.|9704
> 2014-03-27 14:59:20|ETL Service Stopped
> 2014-03-27 15:22:22|ETL Service Started
> Warning:the operation 'max' for the output column 'max_ksm_shared_memory_mb'
> can't be processed because of incompatible input and/or output types
> 
> Is it a new Bug blocking the current one, or should this one be re-assign ?

This bug is puzzling me. Can you try a fresh install of engine and let me know if this still happens? the definition in the project looks good. 


Yaniv

Comment 25 Yaniv Lavi 2014-03-30 12:04:22 UTC
(In reply to Yaniv Dary from comment #24)
> (In reply to Barak Dagan from comment #19)
> > ON av5:
> > 
> > 2014-03-19
> > 14:25:
> > 00|w35Z39|NTMNnU|U42Rbh|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|Default|5|tWarn
> > |tWarn_1|Can not sample data, oVirt Engine is not updating the statistics.
> > Please check your oVirt Engine status.|9704
> > 2014-03-27 14:59:20|ETL Service Stopped
> > 2014-03-27 15:22:22|ETL Service Started
> > Warning:the operation 'max' for the output column 'max_ksm_shared_memory_mb'
> > can't be processed because of incompatible input and/or output types
> > 
> > Is it a new Bug blocking the current one, or should this one be re-assign ?
> 
> This bug is puzzling me. Can you try a fresh install of engine and let me
> know if this still happens? the definition in the project looks good. 
> 
> 
> Yaniv

Never mind was able to recreate and fix. Movin to MODIFIED.

Comment 27 Barak Dagan 2014-04-02 08:15:46 UTC
Verified on av4.1:

rhevm-3.4.0-0.10.beta2.el6ev.noarch

rhevm-dwh-3.4.0-6.el6ev.noarch
rhevm-dwh-setup-3.4.0-6.el6ev.noarch

rhevm-reports-3.4.0-2.el6ev.noarch
rhevm-reports-setup-3.4.0-2.el6ev.noarch

jasperreports-server-pro-5.5.0-9.el6ev.noarch


bash-4.1$ psql -d engine -c "select column_name, data_type from information_schema.columns where table_name ='dwh_host_history_view' and (column_name like '%ksm%' or column_name like '%mem%');
> "
could not change directory to "/root"
     column_name      | data_type 
----------------------+-----------
 memory_usage_percent | smallint
 ksm_cpu_percent      | smallint
 ksm_shared_memory_mb | bigint


bash-4.1$ psql -d ovirt_engine_history -c "select table_name, column_name, data_type from information_schema.columns where table_name like 'host%samples%' and (column_name like '%ksm%' or column_name like '%mem%');"
could not change directory to "/root"
      table_name      |     column_name      | data_type 
----------------------+----------------------+-----------
 host_samples_history | memory_usage_percent | smallint
 host_samples_history | ksm_cpu_percent      | smallint
 host_samples_history | ksm_shared_memory_mb | bigint


bash-4.1$ psql -d ovirt_engine_history -c "select table_name, column_name, data_type from information_schema.columns where table_name like 'host%hour%' and column_name like '%ksm%';"
could not change directory to "/root"
     table_name      |       column_name        | data_type 
---------------------+--------------------------+-----------
 host_hourly_history | ksm_cpu_percent          | smallint
 host_hourly_history | max_ksm_cpu_percent      | smallint
 host_hourly_history | ksm_shared_memory_mb     | bigint
 host_hourly_history | max_ksm_shared_memory_mb | bigint
(4 rows)


bash-4.1$ psql -d ovirt_engine_history -c "select table_name, column_name, data_type from information_schema.columns where table_name like 'host%daily%' and column_name like '%ksm%';"
could not change directory to "/root"
     table_name     |       column_name        | data_type 
--------------------+--------------------------+-----------
 host_daily_history | ksm_cpu_percent          | smallint
 host_daily_history | max_ksm_cpu_percent      | smallint
 host_daily_history | ksm_shared_memory_mb     | bigint
 host_daily_history | max_ksm_shared_memory_mb | bigint
(4 rows)


bash-4.1$ psql -d ovirt_engine_history -c "select date_trunc('hour', history_datetime), count(*) from host_samples_history group by 1;" | less -S
       date_trunc       | count 
------------------------+-------
 2014-04-01 14:00:00+03 |    55
 2014-04-01 15:00:00+03 |   120
 2014-04-01 16:00:00+03 |   120
 2014-04-01 17:00:00+03 |   120
...
 2014-04-02 07:00:00+03 |   120
 2014-04-02 08:00:00+03 |   120
 2014-04-02 09:00:00+03 |   120
 2014-04-02 10:00:00+03 |   100
(21 rows)



bash-4.1$ psql -d ovirt_engine_history -c "select date_trunc('hour', history_datetime), count(*) from host_hourly_history group by 1 order by 1;" | less -S
       date_trunc       | count 
------------------------+-------
 2014-04-01 14:00:00+03 |     2
 2014-04-01 15:00:00+03 |     2
 2014-04-01 16:00:00+03 |     2
 2014-04-01 17:00:00+03 |     2
...
 2014-04-02 05:00:00+03 |     2
 2014-04-02 06:00:00+03 |     2
 2014-04-02 07:00:00+03 |     2
 2014-04-02 08:00:00+03 |     2
(19 rows)



bash-4.1$ psql -d ovirt_engine_history -c "select date_trunc('hour', history_datetime), count(*) from host_daily_history group by 1 order by 1;" | less -S
 date_trunc | count 
------------+-------
(0 rows)   ----> will continue monitor

Comment 28 Barak Dagan 2014-04-03 08:04:52 UTC
sql -d ovirt_engine_history -c "select date_trunc('hour', history_datetime), count(*) from host_daily_history group by 1 order by 1;" | less -S

       date_trunc       | count 
------------------------+-------
 2014-04-01 00:00:00+03 |     2
(1 row)

Comment 29 errata-xmlrpc 2014-06-09 15:18:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0601.html


Note You need to log in before you can comment on or make changes to this bug.