1424716 – Benchmark timings are incorrect for all workers in evm.log

Bug 1424716 - Benchmark timings are incorrect for all workers in evm.log

Summary: Benchmark timings are incorrect for all workers in evm.log

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Providers
Sub Component:
Version:	5.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	GA
Target Release:	5.9.0
Assignee:	Joe Rafaniello
QA Contact:	Tasos Papaioannou
Docs Contact:
URL:
Whiteboard:	provider
Depends On:
Blocks:	1449392 1449393 1459227
TreeView+	depends on / blocked

Reported:	2017-02-18 17:06 UTC by Peter McGowan
Modified:	2018-04-11 18:32 UTC (History)
CC List:	9 users (show)
Fixed In Version:	5.9.0.1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1449392 1449393 1459227 (view as bug list)
Environment:
Last Closed:	2018-03-06 15:46:00 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Ruby script to demonstrate the problem (11.73 KB, text/plain) 2017-02-18 17:07 UTC, Peter McGowan	no flags	Details
evm.log for comment #8 (3.86 MB, application/x-gzip) 2017-04-05 08:51 UTC, Peter McGowan	no flags	Details
View All

Description Peter McGowan 2017-02-18 17:06:52 UTC

Description of problem:
At the completion of an EMS refresh, the worker prints some timings to evm.log showing the time taken for each stage in the refresh operation. For example:

Refreshing targets for EMS...Complete - Timings {:server_dequeue=>0.009632349014282227, :get_ems_data=>112.41256022453308, :get_vc_data=>888.4312617778778, :filter_vc_data=>0.4041626453399658, :get_vc_data_host_scsi=>494.66731095314026, :collect_inventory_for_targets=>1499.7596287727356, :parse_vc_data=>7.983795404434204, :parse_targeted_inventory=>8.024113893508911, :db_save_inventory=>731.6735608577728, :save_inventory=>731.6747822761536, :ems_refresh=>2239.5073816776276}

These timings are currently incorrect. Rather than show the timing of the individual refresh operation, it looks like the timer values are not being zeroed prior to each refresh. As a result each timer value incrementally grows by the sum of the duration of the most recent operation and the previous timer value. The timers are only reset when a worker is restarted.

The problem can be more easily visualised by running the attached ruby script. Run as:

ruby ems_refresh_timings.rb -f evm.log

It will print an output similar to the following, and it can be seen that each value is the sum of the current time and the previous value, i.e.

Worker PID:         8515
Refresh start time: 2017-01-30T03:17:57.168530
Refresh end time:   2017-01-30T03:18:16.228082
Provider:           Vmware::InfraManager
EMS Name:           CLOUD-FRANCE-VCENTER
Refresh type:       targeted
Refresh targets:    {"Vm"=>"frpveeam01.solar.priv"}
Refresh timings:
  server_dequeue:                      0.00289
  get_ems_data:                        0.03904
  get_vc_data:                         8.90687
  filter_vc_data:                      0.01239
  get_vc_data_host_scsi:               1.49227
  collect_inventory_for_targets:       10.45161
  parse_vc_data:                       0.04002
  parse_targeted_inventory:            0.04020
  db_save_inventory:                   8.56718
  save_inventory:                      8.56720
  ems_refresh:                         19.05932

Worker PID:         8515
Refresh start time: 2017-01-30T03:20:47.844694
Refresh end time:   2017-01-30T03:21:05.605325
Provider:           Vmware::InfraManager
EMS Name:           CLOUD-FRANCE-VCENTER
Refresh type:       targeted
Refresh targets:    {"Vm"=>"frpveeam01.solar.priv"}
Refresh timings:
  server_dequeue:                      0.00289
  get_ems_data:                        0.08007
  get_vc_data:                         23.44743
  filter_vc_data:                      0.01333
  get_vc_data_host_scsi:               2.67952
  collect_inventory_for_targets:       26.22239
  parse_vc_data:                       0.05610
  parse_targeted_inventory:            0.05800
  db_save_inventory:                   10.53875
  save_inventory:                      10.53877
  ems_refresh:                         36.81976

Worker PID:         8515
Refresh start time: 2017-01-30T03:21:28.736595
Refresh end time:   2017-01-30T03:21:48.356252
Provider:           Vmware::InfraManager
EMS Name:           CLOUD-FRANCE-VCENTER
Refresh type:       targeted
Refresh targets:    {"Vm"=>"frpipam01.solar.priv"}
Refresh timings:
  server_dequeue:                      0.00289
  get_ems_data:                        0.15466
  get_vc_data:                         36.64317
  filter_vc_data:                      0.01537
  get_vc_data_host_scsi:               4.23346
  collect_inventory_for_targets:       41.04998
  parse_vc_data:                       0.09418
  parse_targeted_inventory:            0.10141
  db_save_inventory:                   15.28555
  save_inventory:                      15.28559
  ems_refresh:                         56.43840


Version-Release number of selected component (if applicable):
5.7.0.17.20161219135818_725f92f 

How reproducible:
every time


Steps to Reproduce:
1. Find an appliance that has run several EMS refreshes.
2. Run the attached script to display the benchmark timings

Actual results:
the timings are calculated incorrectly

Expected results:
The timers should be reset prior to each EMS refresh


Additional info:

Comment 2 Peter McGowan 2017-02-18 17:07:54 UTC

Created attachment 1255286 [details]
Ruby script to demonstrate the problem

Comment 4 Peter McGowan 2017-02-19 15:55:47 UTC

This problem does not seem to be confined to Benchmark.realtime_block timings for EMS refreshes. It also affects Benchmark.realtime_block timings for metrics collection, and possibly others.

Comment 5 Adam Grare 2017-02-23 18:51:29 UTC

There is a github issue describing this exact issue, https://github.com/ManageIQ/manageiq/issues/9642
Appears to be caused by Timeout skipping the ensure block of the benchmark and not resetting the timings for the next block.

Comment 6 Peter McGowan 2017-02-24 08:08:01 UTC

From the comment in the code:

# If the hash already has a value for that key, the time is accumulated.

This is exactly what's happening, but it looks like we're getting a timeout on every call. It seems to affect all realtime counters that are printed to evm.log from any worker.

Comment 8 Peter McGowan 2017-04-05 08:46:48 UTC

I've discovered some additional information which might be of use. I'm not convinced that this is as random as comment #5 suggests.

The attached log file evm.log-20170405 is from a CFME 5.7 appliance with a single RHEV provider. As a result of some testing that I was performing, the evmserverd process was restarted several times during the day, which is why the log file contains the output from 7 different MetricsCollectorWorkers, with the following PIDs:

Worker PID:  12345
Worker PID:  12354
Worker PID:  3088
Worker PID:  4760
Worker PID:  4769
Worker PID:  9231
Worker PID:  9240

Grepping for any of the PIDs except 3088 in the file shows the timing errors:

grep 9240: evm.log-20170405 | grep Timings:

...
[----] I, [2017-04-05T02:49:47.058033 #9240:8a9134]  INFO -- : MIQ(ManageIQ::Providers::Redhat::InfraManager::Vm#perf_process) [realtime] Processing for ManageIQ::Providers::Redhat::InfraManager::Vm name: [cloudforms06.bit63.net], id: [4], for range [2017-04-05T00:58:57Z - 2017-04-05T01:49:37Z]...Complete - Timings: {:heartbeat=>0.02439093589782715, :server_dequeue=>0.0049250125885009766, :server_monitor=>0.11138129234313965, :log_active_servers=>0.009104013442993164, :capture_state=>4.844895601272583, :rhevm_connect=>0.4119272232055664, :collect_data=>4.6500403881073, :total_time=>190.25867748260498, :process_counter_values=>0.3167150020599365, :db_find_prev_perfs=>0.43532872200012207, :process_perfs=>27.193140745162964, :process_perfs_db=>143.64353847503662, :db_find_storage_files=>0.35480237007141113, :init_attrs=>0.3011751174926758, :process_perfs_tag=>0.02492809295654297, :process_bottleneck=>4.054227590560913}
[----] I, [2017-04-05T03:02:03.452602 #9240:8a9134]  INFO -- : MIQ(Storage#perf_capture) [hourly] Capture for Storage name: [vm_exports], id: [4]...Complete - Timings: {:heartbeat=>0.02439093589782715, :server_dequeue=>0.0049250125885009766, :server_monitor=>0.11138129234313965, :log_active_servers=>0.009104013442993164, :capture_state=>4.910523176193237, :rhevm_connect=>0.4119272232055664, :collect_data=>4.6500403881073, :total_time=>190.44148135185242, :process_counter_values=>0.3167150020599365, :db_find_prev_perfs=>0.44121289253234863, :process_perfs=>27.23559808731079, :process_perfs_db=>143.64353847503662, :db_find_storage_files=>0.3605070114135742, :init_attrs=>0.3113737106323242, :process_perfs_tag=>0.026996135711669922, :process_bottleneck=>4.056962728500366}
...

If you grep for PID 3088 however the timings are all correct:

grep 3088: evm.log-20170405 | grep Timings:
[----] I, [2017-04-04T10:46:47.657733 #3088:efd134]  INFO -- : MIQ(ManageIQ::Providers::Redhat::InfraManager::Host#perf_capture) [realtime] Capture for ManageIQ::Providers::Redhat::InfraManager::Host name: [rhevh01.bit63.net], id: [1]...Complete - Timings: {:capture_state=>1.905686616897583, :rhevm_connect=>0.4028127193450928, :collect_data=>0.18759369850158691, :total_time=>2.512446880340576}
[----] I, [2017-04-04T10:46:52.645363 #3088:efd134]  INFO -- : MIQ(ManageIQ::Providers::Redhat::InfraManager::Host#perf_process) [realtime] Processing for ManageIQ::Providers::Redhat::InfraManager::Host name: [rhevh01.bit63.net], id: [1], for range [2017-04-04T08:55:54Z - 2017-04-04T09:46:34Z]...Complete - Timings: {:process_counter_values=>0.019831418991088867, :db_find_prev_perfs=>0.750178337097168, :process_perfs=>0.6371142864227295, :process_perfs_db=>3.5079152584075928, :total_time=>4.987343072891235}
[----] I, [2017-04-04T10:46:53.070947 #3088:efd134]  INFO -- : MIQ(ManageIQ::Providers::Redhat::InfraManager::Host#perf_capture) [realtime] Capture for ManageIQ::Providers::Redhat::InfraManager::Host name: [rhevh02.bit63.net], id: [2]...Complete - Timings: {:capture_state=>0.12994980812072754, :rhevm_connect=>0.010659217834472656, :collect_data=>0.11516714096069336, :total_time=>0.2560760974884033}

I also notice that when the timings are incorrect, the output contains a lot of spurious timing values from other counters such as ':heartbeat', which is a server monitor timer, not a metrics collector worker timer. When the timings are correct, only the correct timers from the metrics collector worker are printed.

Comment 9 Peter McGowan 2017-04-05 08:51:59 UTC

Created attachment 1268888 [details]
evm.log for comment #8

This is the evm.log that comment #8 describes

Comment 10 Joe Rafaniello 2017-04-05 15:14:45 UTC

Peter, can you grep the logs of "good timings" vs. "bad timings" to see if there is a TimeoutError that occurred in the bad timings?  Basically, anytime we do Benchmark blocks in a specific thread and a timeout occurs, we don't clear the resulting timing hash so it will still around for the next Benchmark block in the same thread.  That would explain why heartbeat keys would be in the perf_capture benchmarking.

I really don't know how to get around this other than removing use of Timeout.timeout since the ensure from our Benchmark block is not guaranteed to be run if a timeout raises while in the ensure.

https://github.com/ManageIQ/manageiq-gems-pending/blob/4f69940d10c51ffb9e918d1f0b91a03c5ef77bd7/lib/gems/pending/util/extensions/miq-benchmark.rb#L33

Comment 11 Peter McGowan 2017-04-05 15:20:48 UTC

Both good timing and bad timings are in the same log file, but grep -i TimeoutError of the log shows nothing. I'm happy to enable debug-level logging it it would help?

I just seemed strange to me that one PID shows no bad timings, while the other 6 PIDs show no good timings.

Comment 12 CFME Bot 2017-04-24 21:41:45 UTC

https://github.com/ManageIQ/manageiq/pull/14868

Comment 13 Joe Rafaniello 2017-04-24 21:42:16 UTC

Peter, thanks for the bug and the insistence... I think I found the answer to the problem you discovered.  See:

https://github.com/ManageIQ/manageiq/pull/14868

Basically, we inherit a global variable via a class variable for the hash of results when we fork child processes.  Since we're in the server's monitor loop, we're in a Benchmark block so we're inheriting the results from this block in each forked child worker.  This is why you see log_active_servers, server_monitor, etc. in your results.

My PR above hacks this by clearing this variable referencing the shared hash in each forked child.   Can you try it and let me know if that resolves the issue?

You can apply the patch by curl/wget directly on the patch file:
https://patch-diff.githubusercontent.com/raw/ManageIQ/manageiq/pull/14868.patch

Comment 14 Peter McGowan 2017-04-25 13:27:42 UTC

Thanks Joe.

This is looking good. I've applied the patch to a CFME 5.7 and 5.8 beta2 system and I haven't seen the problem since on either system. Very promising!

Comment 15 Joe Rafaniello 2017-04-25 19:36:51 UTC

Hi Peter, I found a better solution in our benchmark extension.

https://github.com/ManageIQ/manageiq-gems-pending/pull/139

Basically, index into the "inherited class variable hash" using PID + ThreadID so you don't read another process's values.

I'll be closing https://github.com/ManageIQ/manageiq/pull/14868, but it's a much easier hack to apply and is completely valid if you need to get around it until the PR 139 is merged and released.

Comment 16 CFME Bot 2017-04-26 00:27:53 UTC

New commit detected on ManageIQ/manageiq-gems-pending/master:
https://github.com/ManageIQ/manageiq-gems-pending/commit/e8c1037a9b5e114585ca61de34985c94db013aeb

commit e8c1037a9b5e114585ca61de34985c94db013aeb
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Tue Apr 25 15:19:52 2017 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Tue Apr 25 15:29:19 2017 -0400

    Use pid + thread id to index into the benchmark results
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1424716
    
    Forks inherit the @@realtime_by_tid and parent/child Thread.current.object_id
    are equal, so we need to index into the hash with the pid too.
    
    The following test recreates and verifies this change but because it
    forks and doesn't properly bubble up exit codes, I wrote it just to
    confirm the problem.
    
    ```ruby
      it "doesn't get parent's realtime_block in forked child" do
        p1_result, p1_timings = Benchmark.realtime_block(:parent_task1) do
          Timecop.travel(1000)
          p2_result, p2_timings = Benchmark.realtime_block(:parent_task2) do
            Timecop.travel(500)
          end
    
          expect(p2_timings[:parent_task2]).to be_within(0.5).of(500)
          expect(p2_timings.keys).to eq([:parent_task2])
    
          pid = fork do
            c_result, c_timings = Benchmark.realtime_block(:child_task) do
              Timecop.travel(250)
            end
            expect(c_timings[:child_task]).to be_within(0.5).of(250)
            expect(c_timings.keys).to eq([:child_task])
          end
        end
    
        expect(p1_timings[:parent_task1]).to be_within(0.5).of(1500)
        expect(p1_timings.keys.sort).to eq([:parent_task1, :parent_task2])
      end
    ```
    
    The tests pass because I'm not handling exit codes properly, but I
    see the expectation error pop up when I roll back my code change:
    
    ```
    /Users/joerafaniello/.gem/ruby/2.3.4/gems/rspec-support-3.5.0/lib/rspec/support.rb:87:in
    `block in <module:Support>':
    (RSpec::Expectations::ExpectationNotMetError)
    expected: [:child_task]
         got: [:parent_task2, :child_task]
    
         (compared using ==)
    
         Diff:
         @@ -1,2 +1,2 @@
         -[:child_task]
         +[:parent_task2, :child_task]
    ```

 lib/gems/pending/util/extensions/miq-benchmark.rb | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

Comment 17 CFME Bot 2017-04-26 02:16:51 UTC

https://github.com/ManageIQ/manageiq/pull/14890

Comment 18 CFME Bot 2017-04-26 02:21:50 UTC

https://github.com/ManageIQ/manageiq/pull/14891

Comment 19 Joe Rafaniello 2017-04-26 02:31:45 UTC

Peter, can you give a quick summary of how to reproduce this for QE?  Which worker hit it most often?  Which is easiest to see this behavior?  I guess the main symptom would be any worker that logs it's benchmark timings of various operations, for example from one of the comments above:

Refreshing targets for EMS...Complete - Timings {:server_dequeue=>0.009632349014282227, :get_ems_data=>112.41256022453308, :get_vc_data=>888.4312617778778, :filter_vc_data=>0.4041626453399658, :get_vc_data_host_scsi=>494.66731095314026, :collect_inventory_for_targets=>1499.7596287727356, :parse_vc_data=>7.983795404434204, :parse_targeted_inventory=>8.024113893508911, :db_save_inventory=>731.6735608577728, :save_inventory=>731.6747822761536, :ems_refresh=>2239.5073816776276}

The fact that server_dequeue is a key in this hash is wrong.  This server_dequeue key comes from the parent server process that forked the ems refresh worker.

To confirm this fix works, you'd need to first recreate the logging of the timings above with server_dequeue (or server_monitor, or even log_active_servers) showing up in the completely unrelated benchmark and see that these "server" related benchmark timings do not show up in the logs of ems refresh, or cap and u workers...

Comment 20 Peter McGowan 2017-04-26 16:16:46 UTC

The 'bug' was usually the default behaviour so it was difficult to say how to reproduce. Any of the following commands would illustrate the problem:

grep ems_refresh evm.log | grep Timings
grep perf_process evm.log | grep Timings
grep perf_rollup evm.log | grep Timings


If successive output lines (for the same PID) from any of these commands show the same counter as incrementing (and often being unrealistically large as a result), then this illustrates the problem, i.e.

[----] I, [2017-01-27T07:45:02.478585 #26941:66b14c]  INFO -- : MIQ(ManageIQ::Providers::Vmware::InfraManager::Vm#perf_rollup) [hourly] Rollup for ManageIQ::Providers::Vmware::InfraManager::Vm name: [VLAB502], id: [1000000000260] for time: [2017-01-27T05:00:00Z]...Complete - Timings: {... :total_time=>2097.766345500946...}

[----] I, [2017-01-27T07:45:03.325463 #26941:66b14c]  INFO -- : MIQ(ManageIQ::Providers::Vmware::InfraManager::Vm#perf_rollup) [hourly] Rollup for ManageIQ::Providers::Vmware::InfraManager::Vm name: [VSAF002P], id: [1000000000450] for time: [2017-01-27T05:00:00Z]...Complete - Timings: {... :total_time=>2098.3906078338623, ...}

If the timing values for successive lines look independent and aren't incrementing, the problem is not showing, or has been fixed.

Comment 23 CFME Bot 2017-05-23 16:21:04 UTC

New commit detected on ManageIQ/manageiq/euwe:
https://github.com/ManageIQ/manageiq/commit/05aa7b963f556cd71448f22f38d4af8505aea7e1

commit 05aa7b963f556cd71448f22f38d4af8505aea7e1
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Tue Apr 25 22:05:41 2017 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Tue Apr 25 22:06:32 2017 -0400

    Use pid + thread id to index into the benchmark results
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1424716
    
    Note: This is a manual backport of:
    https://github.com/ManageIQ/manageiq-gems-pending/pull/139
    
    Forks inherit the @@realtime_by_tid and parent/child Thread.current.object_id
    are equal, so we need to index into the hash with the pid too.
    
    The following test recreates and verifies this change but because it
    forks and doesn't properly bubble up exit codes, I wrote it just to
    confirm the problem.
    
    ```ruby
      it "doesn't get parent's realtime_block in forked child" do
        p1_result, p1_timings = Benchmark.realtime_block(:parent_task1) do
          Timecop.travel(1000)
          p2_result, p2_timings = Benchmark.realtime_block(:parent_task2) do
            Timecop.travel(500)
          end
    
          expect(p2_timings[:parent_task2]).to be_within(0.5).of(500)
          expect(p2_timings.keys).to eq([:parent_task2])
    
          pid = fork do
            c_result, c_timings = Benchmark.realtime_block(:child_task) do
              Timecop.travel(250)
            end
            expect(c_timings[:child_task]).to be_within(0.5).of(250)
            expect(c_timings.keys).to eq([:child_task])
          end
        end
    
        expect(p1_timings[:parent_task1]).to be_within(0.5).of(1500)
        expect(p1_timings.keys.sort).to eq([:parent_task1, :parent_task2])
      end
    ```
    
    The tests pass because I'm not handling exit codes properly, but I
    see the expectation error pop up when I roll back my code change:
    
    ```
    /Users/joerafaniello/.gem/ruby/2.3.4/gems/rspec-support-3.5.0/lib/rspec/support.rb:87:in
    `block in <module:Support>':
    (RSpec::Expectations::ExpectationNotMetError)
    expected: [:child_task]
         got: [:parent_task2, :child_task]
    
         (compared using ==)
    
         Diff:
         @@ -1,2 +1,2 @@
         -[:child_task]
         +[:parent_task2, :child_task]
    ```

 gems/pending/util/extensions/miq-benchmark.rb | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

Comment 25 Tasos Papaioannou 2017-10-09 20:44:53 UTC

Verified on 5.9.0.1.

Note You need to log in before you can comment on or make changes to this bug.