1463165 – Records with duplicate timestamp in metrics rollup table

Bug 1463165 - Records with duplicate timestamp in metrics rollup table

Summary: Records with duplicate timestamp in metrics rollup table

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Providers
Sub Component:
Version:	5.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	GA
Target Release:	5.9.0
Assignee:	James Wong
QA Contact:	Nandini Chandra
Docs Contact:
URL:
Whiteboard:	c&u
Depends On:
Blocks:	1505415 1505417
TreeView+	depends on / blocked

Reported:	2017-06-20 09:39 UTC by Nandini Chandra
Modified:	2018-03-06 14:54 UTC (History)
CC List:	8 users (show)
Fixed In Version:	5.9.0.4
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1505415 1505417 (view as bug list)
Environment:
Last Closed:	2018-03-06 14:54:01 UTC
Category:	---
Cloudforms Team:	VMware
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
screen shot of hourly C&U graph with duplicate time stamps (62.70 KB, image/png) 2017-06-20 09:41 UTC, Nandini Chandra	no flags	Details
View All

Description Nandini Chandra 2017-06-20 09:39:36 UTC

Description of problem:
-----------------------
The metrics rollup table has a few records with duplicate time stamp, although only one record contains C&U data.

The table has 24 records, like expected, for the cu-24x7 VM on 06/15/17.
But, it has 27 records for the same VM on 06/14/17.The C&U graphs also reflect this and there are gaps in the graph and the bottom section also has records with duplicate time stamp (See attached screen shot).

vmdb_production=# select count(*) from metric_rollups where resource_name = 'cu-24x7' and capture_interval_name = 'hourly' and timestamp > '06/14/17' and timestamp <= '06/15/17' and parent_ems_id = 1\x\g\x
Expanded display is on.
-[ RECORD 1 ]
count | 24


vmdb_production=# select count(*) from metric_rollups where resource_name = 'cu-24x7' and capture_interval_name = 'hourly' and timestamp > '06/13/17' and timestamp <= '06/14/17' and parent_ems_id = 1\x\g\x
Expanded display is on.
-[ RECORD 1 ]
count | 27

vmdb_production=# select * from metric_rollups where resource_name = 'cu-24x7' and capture_interval_name = 'hourly' and timestamp = '2017-06-13 05:00:00' and parent_ems_id = 1\x\g\x
Expanded display is on.
-[ RECORD 1 ]--------------------------+-------------------------------------------------------------------------------------------
id                                     | 8405
timestamp                              | 2017-06-13 05:00:00
capture_interval                       | 3600
resource_type                          | VmOrTemplate
resource_id                            | 12
cpu_usage_rate_average                 | 0.790611111111111
cpu_usagemhz_rate_average              | 37.7333333333333
mem_usage_absolute_average             | 0.591055555555555
disk_usage_rate_average                | 20.9944444444444
net_usage_rate_average                 | 0
sys_uptime_absolute_latest             | 1590893

-[ RECORD 2 ]--------------------------+-------------------------------------------------------------------------------------------
id                                     | 8267
timestamp                              | 2017-06-13 05:00:00
capture_interval                       | 3600
resource_type                          | VmOrTemplate
resource_id                            | 12
cpu_usage_rate_average                 | 
cpu_usagemhz_rate_average              | 
mem_usage_absolute_average             | 
disk_usage_rate_average                | 
net_usage_rate_average                 | 



Version-Release number of selected component (if applicable):
-------------------------------------------------------------
5.7.3.1


How reproducible:
-----------------
Not sure


Steps to Reproduce:
-------------------
1.Manage a provider.Enable C&U collection for the provider.
2.Check the DB.


Actual results:
---------------
Records with duplicate time stamp in metrics rollup table


Expected results:
-----------------
There should be no records with duplicate time stamp in the metrics rollup table.


Additional info:
----------------

Comment 2 Nandini Chandra 2017-06-20 09:41:56 UTC

Created attachment 1289542 [details]
screen shot of hourly C&U graph with duplicate time stamps

Comment 3 Bronagh Sorota 2017-06-20 13:14:35 UTC

Hi Nandini,
Which provider is this for?

Thanks
Bronagh

Comment 4 Nandini Chandra 2017-06-20 18:02:57 UTC

I have seen this issue with vSphere.

I'll update the BZ with test results from other providers.

Comment 5 Bronagh Sorota 2017-06-23 13:27:16 UTC

That would be great Nandini, that will tell me if this is VMware specific issue or a general back end issue and will allow me to assign this correctly.

Bronagh

Comment 6 Nandini Chandra 2017-07-03 07:41:27 UTC

Bronagh,

This issue is not seen with Azure, EC2, GCE. So,this could be a VMware specific issue.

Comment 7 Bronagh Sorota 2017-07-03 13:10:10 UTC

Thanks for the investigation, Nandini.

Comment 8 Tasos Papaioannou 2017-08-17 20:25:02 UTC

FYI, I see this same issue with RHEV guests. For example,

****
vmdb_production=# select id, name from ext_management_systems order by id;
 id |   name    
----+-----------
  1 | rhv41
  2 | vsphere65
  3 | rhevm36
(3 rows)

vmdb_production=# select timestamp, created_on, resource_name, resource_id from metric_rollups where resource_type='VmOrTemplate' and parent_ems_id=1 order by timestamp desc, resource_name;
      timestamp      |         created_on         |       resource_name       | resource_id 
---------------------+----------------------------+---------------------------+-------------
[...]
 2017-08-17 11:00:00 | 2017-08-17 11:01:02.735282 | cfme_57_vm_0              |           1
 2017-08-17 11:00:00 | 2017-08-17 11:04:14.728524 | cfme_57_vm_0              |           1
 2017-08-17 11:00:00 | 2017-08-17 11:01:02.81147  | cfme_58_vm_0              |           2
 2017-08-17 11:00:00 | 2017-08-17 11:04:14.774167 | cfme_58_vm_0              |           2
 2017-08-17 11:00:00 | 2017-08-17 11:04:14.752306 | golden_env_mixed_virtio_0 |           3
 2017-08-17 11:00:00 | 2017-08-17 11:01:02.775244 | golden_env_mixed_virtio_0 |           3
 2017-08-17 11:00:00 | 2017-08-17 11:01:02.847302 | golden_env_mixed_virtio_1 |           4
 2017-08-17 11:00:00 | 2017-08-17 11:04:14.795609 | golden_env_mixed_virtio_1 |           4
[...]
****

In each case, the duplicate rollup records were created within ~2.5-3 minutes of each other.

Comment 9 Tasos Papaioannou 2017-08-30 19:22:34 UTC

Some add'l info:

In my testing, the duplicate rollup records are created when Storage.perf_capture_hourly silently fails to commit an update to the last_perf_capture_on column for a storage record in the "storages" table. The next time the perf_capture_timer runs, it checks the last_perf_capture_on column, and determines that it needs to schedule a capture again, which results in the duplicate rollup records for the associated VMs.

Here's an example:

vmdb_production=# select id, timestamp, created_on, length(min_max)>0 as is_complete from metric_rollups where resource_name='cu-24x7' and capture_interval_name='hourly' order by timestamp desc;

  id   |      timestamp      |         created_on         | is_complete 
-------+---------------------+----------------------------+-------------
 39656 | 2017-08-29 20:00:00 | 2017-08-29 20:04:34.073343 | t
 39621 | 2017-08-29 20:00:00 | 2017-08-29 20:01:51.944124 | 

With some extra logging added to Metric::Capture.filter_perf_capture_now and Storage#perf_capture, we see the following:

1.) 1st run sees that the storage record has last_perf_capture_on set to 2017-08-29 19:00:00 UTC, which is earlier than the threshold, so it's eligible for capture:

[----] I, [2017-08-29T16:01:14.423783 #33143:625138]  INFO -- : MIQ(Metric::Capture.filter_perf_capture_now) tpapaioa capture target rhevm36_iscsi_data last_perf_capture_on = 2017-08-29 19:00:00 UTC standard_capture_threshold = 2017-08-29 19:01:14 UTC

2.) After processing the associated VMs to perform rollups for, Storage#perf_capture calls update_attribute(:last_perf_capture_on, hour) to update the storage record's last_perf_capture_on column. update_attribute returns true.

[----] I, [2017-08-29T16:01:52.184822 #33186:625138]  INFO -- : MIQ(Storage#perf_capture) tpapaioa updating Storage name: [rhevm36_iscsi_data], id: [16] last_perf_capture_on to: 2017-08-29T20:00:00Z
[----] I, [2017-08-29T16:01:52.188526 #33186:625138]  INFO -- : MIQ(Storage#perf_capture) tpapaioa update_attribute returned true

3.) The next time the timer runs, it sees the old value for last_perf_capture_on:

[----] I, [2017-08-29T16:04:12.111692 #33143:625138]  INFO -- : MIQ(Metric::Capture.filter_perf_capture_now) tpapaioa capture target rhevm36_iscsi_data last_perf_capture_on 2017-08-29 19:00:00 UTC standard_capture_threshold = 2017-08-29 19:04:12 UTC 

so it ends up creating a duplicate rollup record for the associated VMs.

I don't see any errors in the postgres logs when this happens.

Comment 10 Tasos Papaioannou 2017-09-05 17:02:41 UTC

The issue appears to be in the targets returned by Metric::Targets.capture_targets, which are preloaded and sometimes the Storage instances still have the old value of last_perf_capture_on.

In /var/www/miq/vmdb/app/models/metric/capture.rb:

1.) perf_capture_timer calls Metric::Targets.capture_targets to get the storage, host, and vm target candidates. Then it calls filter_perf_capture_now to filter on the targets that are actually due for perf capture, based on last_perf_capture_on:

  def self.perf_capture_timer(zone = nil)
    _log.info "Queueing performance capture..."

    zone ||= MiqServer.my_server.zone

    perf_capture_health_check(zone)
    targets = Metric::Targets.capture_targets(zone)
[...]
    targets = filter_perf_capture_now(targets, target_options)
[...]

  def self.filter_perf_capture_now(targets, target_options)
    targets.select do |target|
      options = target_options[target]
      # [:force] is set if we already determined this target needs perf capture
      if options[:force] || perf_capture_now?(target)
        true
      else
        _log.debug do
          "Skipping capture of #{target.log_target} -" +
            "Performance last captured on [#{target.last_perf_capture_on}] is within threshold"
        end
        false
      end
    end
  end

If I add a target.reload in filter_perf_capture_now before it calls perf_capture_now?(target), then the duplicate rollup records are no longer created.

Comment 11 James Wong 2017-10-13 15:16:16 UTC

PR https://github.com/ManageIQ/manageiq/pull/16166 merged.

Comment 14 Nandini Chandra 2017-11-07 05:12:41 UTC

Verified in 5.9.0.5

Note You need to log in before you can comment on or make changes to this bug.