Description of problem: ----------------------- The metrics rollup table has a few records with duplicate time stamp, although only one record contains C&U data. The table has 24 records, like expected, for the cu-24x7 VM on 06/15/17. But, it has 27 records for the same VM on 06/14/17.The C&U graphs also reflect this and there are gaps in the graph and the bottom section also has records with duplicate time stamp (See attached screen shot). vmdb_production=# select count(*) from metric_rollups where resource_name = 'cu-24x7' and capture_interval_name = 'hourly' and timestamp > '06/14/17' and timestamp <= '06/15/17' and parent_ems_id = 1\x\g\x Expanded display is on. -[ RECORD 1 ] count | 24 vmdb_production=# select count(*) from metric_rollups where resource_name = 'cu-24x7' and capture_interval_name = 'hourly' and timestamp > '06/13/17' and timestamp <= '06/14/17' and parent_ems_id = 1\x\g\x Expanded display is on. -[ RECORD 1 ] count | 27 vmdb_production=# select * from metric_rollups where resource_name = 'cu-24x7' and capture_interval_name = 'hourly' and timestamp = '2017-06-13 05:00:00' and parent_ems_id = 1\x\g\x Expanded display is on. -[ RECORD 1 ]--------------------------+------------------------------------------------------------------------------------------- id | 8405 timestamp | 2017-06-13 05:00:00 capture_interval | 3600 resource_type | VmOrTemplate resource_id | 12 cpu_usage_rate_average | 0.790611111111111 cpu_usagemhz_rate_average | 37.7333333333333 mem_usage_absolute_average | 0.591055555555555 disk_usage_rate_average | 20.9944444444444 net_usage_rate_average | 0 sys_uptime_absolute_latest | 1590893 -[ RECORD 2 ]--------------------------+------------------------------------------------------------------------------------------- id | 8267 timestamp | 2017-06-13 05:00:00 capture_interval | 3600 resource_type | VmOrTemplate resource_id | 12 cpu_usage_rate_average | cpu_usagemhz_rate_average | mem_usage_absolute_average | disk_usage_rate_average | net_usage_rate_average | Version-Release number of selected component (if applicable): ------------------------------------------------------------- 5.7.3.1 How reproducible: ----------------- Not sure Steps to Reproduce: ------------------- 1.Manage a provider.Enable C&U collection for the provider. 2.Check the DB. Actual results: --------------- Records with duplicate time stamp in metrics rollup table Expected results: ----------------- There should be no records with duplicate time stamp in the metrics rollup table. Additional info: ----------------
Created attachment 1289542 [details] screen shot of hourly C&U graph with duplicate time stamps
Hi Nandini, Which provider is this for? Thanks Bronagh
I have seen this issue with vSphere. I'll update the BZ with test results from other providers.
That would be great Nandini, that will tell me if this is VMware specific issue or a general back end issue and will allow me to assign this correctly. Bronagh
Bronagh, This issue is not seen with Azure, EC2, GCE. So,this could be a VMware specific issue.
Thanks for the investigation, Nandini.
FYI, I see this same issue with RHEV guests. For example, **** vmdb_production=# select id, name from ext_management_systems order by id; id | name ----+----------- 1 | rhv41 2 | vsphere65 3 | rhevm36 (3 rows) vmdb_production=# select timestamp, created_on, resource_name, resource_id from metric_rollups where resource_type='VmOrTemplate' and parent_ems_id=1 order by timestamp desc, resource_name; timestamp | created_on | resource_name | resource_id ---------------------+----------------------------+---------------------------+------------- [...] 2017-08-17 11:00:00 | 2017-08-17 11:01:02.735282 | cfme_57_vm_0 | 1 2017-08-17 11:00:00 | 2017-08-17 11:04:14.728524 | cfme_57_vm_0 | 1 2017-08-17 11:00:00 | 2017-08-17 11:01:02.81147 | cfme_58_vm_0 | 2 2017-08-17 11:00:00 | 2017-08-17 11:04:14.774167 | cfme_58_vm_0 | 2 2017-08-17 11:00:00 | 2017-08-17 11:04:14.752306 | golden_env_mixed_virtio_0 | 3 2017-08-17 11:00:00 | 2017-08-17 11:01:02.775244 | golden_env_mixed_virtio_0 | 3 2017-08-17 11:00:00 | 2017-08-17 11:01:02.847302 | golden_env_mixed_virtio_1 | 4 2017-08-17 11:00:00 | 2017-08-17 11:04:14.795609 | golden_env_mixed_virtio_1 | 4 [...] **** In each case, the duplicate rollup records were created within ~2.5-3 minutes of each other.
Some add'l info: In my testing, the duplicate rollup records are created when Storage.perf_capture_hourly silently fails to commit an update to the last_perf_capture_on column for a storage record in the "storages" table. The next time the perf_capture_timer runs, it checks the last_perf_capture_on column, and determines that it needs to schedule a capture again, which results in the duplicate rollup records for the associated VMs. Here's an example: vmdb_production=# select id, timestamp, created_on, length(min_max)>0 as is_complete from metric_rollups where resource_name='cu-24x7' and capture_interval_name='hourly' order by timestamp desc; id | timestamp | created_on | is_complete -------+---------------------+----------------------------+------------- 39656 | 2017-08-29 20:00:00 | 2017-08-29 20:04:34.073343 | t 39621 | 2017-08-29 20:00:00 | 2017-08-29 20:01:51.944124 | With some extra logging added to Metric::Capture.filter_perf_capture_now and Storage#perf_capture, we see the following: 1.) 1st run sees that the storage record has last_perf_capture_on set to 2017-08-29 19:00:00 UTC, which is earlier than the threshold, so it's eligible for capture: [----] I, [2017-08-29T16:01:14.423783 #33143:625138] INFO -- : MIQ(Metric::Capture.filter_perf_capture_now) tpapaioa capture target rhevm36_iscsi_data last_perf_capture_on = 2017-08-29 19:00:00 UTC standard_capture_threshold = 2017-08-29 19:01:14 UTC 2.) After processing the associated VMs to perform rollups for, Storage#perf_capture calls update_attribute(:last_perf_capture_on, hour) to update the storage record's last_perf_capture_on column. update_attribute returns true. [----] I, [2017-08-29T16:01:52.184822 #33186:625138] INFO -- : MIQ(Storage#perf_capture) tpapaioa updating Storage name: [rhevm36_iscsi_data], id: [16] last_perf_capture_on to: 2017-08-29T20:00:00Z [----] I, [2017-08-29T16:01:52.188526 #33186:625138] INFO -- : MIQ(Storage#perf_capture) tpapaioa update_attribute returned true 3.) The next time the timer runs, it sees the old value for last_perf_capture_on: [----] I, [2017-08-29T16:04:12.111692 #33143:625138] INFO -- : MIQ(Metric::Capture.filter_perf_capture_now) tpapaioa capture target rhevm36_iscsi_data last_perf_capture_on 2017-08-29 19:00:00 UTC standard_capture_threshold = 2017-08-29 19:04:12 UTC so it ends up creating a duplicate rollup record for the associated VMs. I don't see any errors in the postgres logs when this happens.
The issue appears to be in the targets returned by Metric::Targets.capture_targets, which are preloaded and sometimes the Storage instances still have the old value of last_perf_capture_on. In /var/www/miq/vmdb/app/models/metric/capture.rb: 1.) perf_capture_timer calls Metric::Targets.capture_targets to get the storage, host, and vm target candidates. Then it calls filter_perf_capture_now to filter on the targets that are actually due for perf capture, based on last_perf_capture_on: def self.perf_capture_timer(zone = nil) _log.info "Queueing performance capture..." zone ||= MiqServer.my_server.zone perf_capture_health_check(zone) targets = Metric::Targets.capture_targets(zone) [...] targets = filter_perf_capture_now(targets, target_options) [...] def self.filter_perf_capture_now(targets, target_options) targets.select do |target| options = target_options[target] # [:force] is set if we already determined this target needs perf capture if options[:force] || perf_capture_now?(target) true else _log.debug do "Skipping capture of #{target.log_target} -" + "Performance last captured on [#{target.last_perf_capture_on}] is within threshold" end false end end end If I add a target.reload in filter_perf_capture_now before it calls perf_capture_now?(target), then the duplicate rollup records are no longer created.
PR https://github.com/ManageIQ/manageiq/pull/16166 merged.
Verified in 5.9.0.5