Description of problem: At the completion of an EMS refresh, the worker prints some timings to evm.log showing the time taken for each stage in the refresh operation. For example: Refreshing targets for EMS...Complete - Timings {:server_dequeue=>0.009632349014282227, :get_ems_data=>112.41256022453308, :get_vc_data=>888.4312617778778, :filter_vc_data=>0.4041626453399658, :get_vc_data_host_scsi=>494.66731095314026, :collect_inventory_for_targets=>1499.7596287727356, :parse_vc_data=>7.983795404434204, :parse_targeted_inventory=>8.024113893508911, :db_save_inventory=>731.6735608577728, :save_inventory=>731.6747822761536, :ems_refresh=>2239.5073816776276} These timings are currently incorrect. Rather than show the timing of the individual refresh operation, it looks like the timer values are not being zeroed prior to each refresh. As a result each timer value incrementally grows by the sum of the duration of the most recent operation and the previous timer value. The timers are only reset when a worker is restarted. The problem can be more easily visualised by running the attached ruby script. Run as: ruby ems_refresh_timings.rb -f evm.log It will print an output similar to the following, and it can be seen that each value is the sum of the current time and the previous value, i.e. Worker PID: 8515 Refresh start time: 2017-01-30T03:17:57.168530 Refresh end time: 2017-01-30T03:18:16.228082 Provider: Vmware::InfraManager EMS Name: CLOUD-FRANCE-VCENTER Refresh type: targeted Refresh targets: {"Vm"=>"frpveeam01.solar.priv"} Refresh timings: server_dequeue: 0.00289 get_ems_data: 0.03904 get_vc_data: 8.90687 filter_vc_data: 0.01239 get_vc_data_host_scsi: 1.49227 collect_inventory_for_targets: 10.45161 parse_vc_data: 0.04002 parse_targeted_inventory: 0.04020 db_save_inventory: 8.56718 save_inventory: 8.56720 ems_refresh: 19.05932 Worker PID: 8515 Refresh start time: 2017-01-30T03:20:47.844694 Refresh end time: 2017-01-30T03:21:05.605325 Provider: Vmware::InfraManager EMS Name: CLOUD-FRANCE-VCENTER Refresh type: targeted Refresh targets: {"Vm"=>"frpveeam01.solar.priv"} Refresh timings: server_dequeue: 0.00289 get_ems_data: 0.08007 get_vc_data: 23.44743 filter_vc_data: 0.01333 get_vc_data_host_scsi: 2.67952 collect_inventory_for_targets: 26.22239 parse_vc_data: 0.05610 parse_targeted_inventory: 0.05800 db_save_inventory: 10.53875 save_inventory: 10.53877 ems_refresh: 36.81976 Worker PID: 8515 Refresh start time: 2017-01-30T03:21:28.736595 Refresh end time: 2017-01-30T03:21:48.356252 Provider: Vmware::InfraManager EMS Name: CLOUD-FRANCE-VCENTER Refresh type: targeted Refresh targets: {"Vm"=>"frpipam01.solar.priv"} Refresh timings: server_dequeue: 0.00289 get_ems_data: 0.15466 get_vc_data: 36.64317 filter_vc_data: 0.01537 get_vc_data_host_scsi: 4.23346 collect_inventory_for_targets: 41.04998 parse_vc_data: 0.09418 parse_targeted_inventory: 0.10141 db_save_inventory: 15.28555 save_inventory: 15.28559 ems_refresh: 56.43840 Version-Release number of selected component (if applicable): 5.7.0.17.20161219135818_725f92f How reproducible: every time Steps to Reproduce: 1. Find an appliance that has run several EMS refreshes. 2. Run the attached script to display the benchmark timings Actual results: the timings are calculated incorrectly Expected results: The timers should be reset prior to each EMS refresh Additional info:
Created attachment 1255286 [details] Ruby script to demonstrate the problem
This problem does not seem to be confined to Benchmark.realtime_block timings for EMS refreshes. It also affects Benchmark.realtime_block timings for metrics collection, and possibly others.
There is a github issue describing this exact issue, https://github.com/ManageIQ/manageiq/issues/9642 Appears to be caused by Timeout skipping the ensure block of the benchmark and not resetting the timings for the next block.
From the comment in the code: # If the hash already has a value for that key, the time is accumulated. This is exactly what's happening, but it looks like we're getting a timeout on every call. It seems to affect all realtime counters that are printed to evm.log from any worker.
I've discovered some additional information which might be of use. I'm not convinced that this is as random as comment #5 suggests. The attached log file evm.log-20170405 is from a CFME 5.7 appliance with a single RHEV provider. As a result of some testing that I was performing, the evmserverd process was restarted several times during the day, which is why the log file contains the output from 7 different MetricsCollectorWorkers, with the following PIDs: Worker PID: 12345 Worker PID: 12354 Worker PID: 3088 Worker PID: 4760 Worker PID: 4769 Worker PID: 9231 Worker PID: 9240 Grepping for any of the PIDs except 3088 in the file shows the timing errors: grep 9240: evm.log-20170405 | grep Timings: ... [----] I, [2017-04-05T02:49:47.058033 #9240:8a9134] INFO -- : MIQ(ManageIQ::Providers::Redhat::InfraManager::Vm#perf_process) [realtime] Processing for ManageIQ::Providers::Redhat::InfraManager::Vm name: [cloudforms06.bit63.net], id: [4], for range [2017-04-05T00:58:57Z - 2017-04-05T01:49:37Z]...Complete - Timings: {:heartbeat=>0.02439093589782715, :server_dequeue=>0.0049250125885009766, :server_monitor=>0.11138129234313965, :log_active_servers=>0.009104013442993164, :capture_state=>4.844895601272583, :rhevm_connect=>0.4119272232055664, :collect_data=>4.6500403881073, :total_time=>190.25867748260498, :process_counter_values=>0.3167150020599365, :db_find_prev_perfs=>0.43532872200012207, :process_perfs=>27.193140745162964, :process_perfs_db=>143.64353847503662, :db_find_storage_files=>0.35480237007141113, :init_attrs=>0.3011751174926758, :process_perfs_tag=>0.02492809295654297, :process_bottleneck=>4.054227590560913} [----] I, [2017-04-05T03:02:03.452602 #9240:8a9134] INFO -- : MIQ(Storage#perf_capture) [hourly] Capture for Storage name: [vm_exports], id: [4]...Complete - Timings: {:heartbeat=>0.02439093589782715, :server_dequeue=>0.0049250125885009766, :server_monitor=>0.11138129234313965, :log_active_servers=>0.009104013442993164, :capture_state=>4.910523176193237, :rhevm_connect=>0.4119272232055664, :collect_data=>4.6500403881073, :total_time=>190.44148135185242, :process_counter_values=>0.3167150020599365, :db_find_prev_perfs=>0.44121289253234863, :process_perfs=>27.23559808731079, :process_perfs_db=>143.64353847503662, :db_find_storage_files=>0.3605070114135742, :init_attrs=>0.3113737106323242, :process_perfs_tag=>0.026996135711669922, :process_bottleneck=>4.056962728500366} ... If you grep for PID 3088 however the timings are all correct: grep 3088: evm.log-20170405 | grep Timings: [----] I, [2017-04-04T10:46:47.657733 #3088:efd134] INFO -- : MIQ(ManageIQ::Providers::Redhat::InfraManager::Host#perf_capture) [realtime] Capture for ManageIQ::Providers::Redhat::InfraManager::Host name: [rhevh01.bit63.net], id: [1]...Complete - Timings: {:capture_state=>1.905686616897583, :rhevm_connect=>0.4028127193450928, :collect_data=>0.18759369850158691, :total_time=>2.512446880340576} [----] I, [2017-04-04T10:46:52.645363 #3088:efd134] INFO -- : MIQ(ManageIQ::Providers::Redhat::InfraManager::Host#perf_process) [realtime] Processing for ManageIQ::Providers::Redhat::InfraManager::Host name: [rhevh01.bit63.net], id: [1], for range [2017-04-04T08:55:54Z - 2017-04-04T09:46:34Z]...Complete - Timings: {:process_counter_values=>0.019831418991088867, :db_find_prev_perfs=>0.750178337097168, :process_perfs=>0.6371142864227295, :process_perfs_db=>3.5079152584075928, :total_time=>4.987343072891235} [----] I, [2017-04-04T10:46:53.070947 #3088:efd134] INFO -- : MIQ(ManageIQ::Providers::Redhat::InfraManager::Host#perf_capture) [realtime] Capture for ManageIQ::Providers::Redhat::InfraManager::Host name: [rhevh02.bit63.net], id: [2]...Complete - Timings: {:capture_state=>0.12994980812072754, :rhevm_connect=>0.010659217834472656, :collect_data=>0.11516714096069336, :total_time=>0.2560760974884033} I also notice that when the timings are incorrect, the output contains a lot of spurious timing values from other counters such as ':heartbeat', which is a server monitor timer, not a metrics collector worker timer. When the timings are correct, only the correct timers from the metrics collector worker are printed.
Created attachment 1268888 [details] evm.log for comment #8 This is the evm.log that comment #8 describes
Peter, can you grep the logs of "good timings" vs. "bad timings" to see if there is a TimeoutError that occurred in the bad timings? Basically, anytime we do Benchmark blocks in a specific thread and a timeout occurs, we don't clear the resulting timing hash so it will still around for the next Benchmark block in the same thread. That would explain why heartbeat keys would be in the perf_capture benchmarking. I really don't know how to get around this other than removing use of Timeout.timeout since the ensure from our Benchmark block is not guaranteed to be run if a timeout raises while in the ensure. https://github.com/ManageIQ/manageiq-gems-pending/blob/4f69940d10c51ffb9e918d1f0b91a03c5ef77bd7/lib/gems/pending/util/extensions/miq-benchmark.rb#L33
Both good timing and bad timings are in the same log file, but grep -i TimeoutError of the log shows nothing. I'm happy to enable debug-level logging it it would help? I just seemed strange to me that one PID shows no bad timings, while the other 6 PIDs show no good timings.
https://github.com/ManageIQ/manageiq/pull/14868
Peter, thanks for the bug and the insistence... I think I found the answer to the problem you discovered. See: https://github.com/ManageIQ/manageiq/pull/14868 Basically, we inherit a global variable via a class variable for the hash of results when we fork child processes. Since we're in the server's monitor loop, we're in a Benchmark block so we're inheriting the results from this block in each forked child worker. This is why you see log_active_servers, server_monitor, etc. in your results. My PR above hacks this by clearing this variable referencing the shared hash in each forked child. Can you try it and let me know if that resolves the issue? You can apply the patch by curl/wget directly on the patch file: https://patch-diff.githubusercontent.com/raw/ManageIQ/manageiq/pull/14868.patch
Thanks Joe. This is looking good. I've applied the patch to a CFME 5.7 and 5.8 beta2 system and I haven't seen the problem since on either system. Very promising!
Hi Peter, I found a better solution in our benchmark extension. https://github.com/ManageIQ/manageiq-gems-pending/pull/139 Basically, index into the "inherited class variable hash" using PID + ThreadID so you don't read another process's values. I'll be closing https://github.com/ManageIQ/manageiq/pull/14868, but it's a much easier hack to apply and is completely valid if you need to get around it until the PR 139 is merged and released.
New commit detected on ManageIQ/manageiq-gems-pending/master: https://github.com/ManageIQ/manageiq-gems-pending/commit/e8c1037a9b5e114585ca61de34985c94db013aeb commit e8c1037a9b5e114585ca61de34985c94db013aeb Author: Joe Rafaniello <jrafanie> AuthorDate: Tue Apr 25 15:19:52 2017 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Tue Apr 25 15:29:19 2017 -0400 Use pid + thread id to index into the benchmark results https://bugzilla.redhat.com/show_bug.cgi?id=1424716 Forks inherit the @@realtime_by_tid and parent/child Thread.current.object_id are equal, so we need to index into the hash with the pid too. The following test recreates and verifies this change but because it forks and doesn't properly bubble up exit codes, I wrote it just to confirm the problem. ```ruby it "doesn't get parent's realtime_block in forked child" do p1_result, p1_timings = Benchmark.realtime_block(:parent_task1) do Timecop.travel(1000) p2_result, p2_timings = Benchmark.realtime_block(:parent_task2) do Timecop.travel(500) end expect(p2_timings[:parent_task2]).to be_within(0.5).of(500) expect(p2_timings.keys).to eq([:parent_task2]) pid = fork do c_result, c_timings = Benchmark.realtime_block(:child_task) do Timecop.travel(250) end expect(c_timings[:child_task]).to be_within(0.5).of(250) expect(c_timings.keys).to eq([:child_task]) end end expect(p1_timings[:parent_task1]).to be_within(0.5).of(1500) expect(p1_timings.keys.sort).to eq([:parent_task1, :parent_task2]) end ``` The tests pass because I'm not handling exit codes properly, but I see the expectation error pop up when I roll back my code change: ``` /Users/joerafaniello/.gem/ruby/2.3.4/gems/rspec-support-3.5.0/lib/rspec/support.rb:87:in `block in <module:Support>': (RSpec::Expectations::ExpectationNotMetError) expected: [:child_task] got: [:parent_task2, :child_task] (compared using ==) Diff: @@ -1,2 +1,2 @@ -[:child_task] +[:parent_task2, :child_task] ``` lib/gems/pending/util/extensions/miq-benchmark.rb | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-)
https://github.com/ManageIQ/manageiq/pull/14890
https://github.com/ManageIQ/manageiq/pull/14891
Peter, can you give a quick summary of how to reproduce this for QE? Which worker hit it most often? Which is easiest to see this behavior? I guess the main symptom would be any worker that logs it's benchmark timings of various operations, for example from one of the comments above: Refreshing targets for EMS...Complete - Timings {:server_dequeue=>0.009632349014282227, :get_ems_data=>112.41256022453308, :get_vc_data=>888.4312617778778, :filter_vc_data=>0.4041626453399658, :get_vc_data_host_scsi=>494.66731095314026, :collect_inventory_for_targets=>1499.7596287727356, :parse_vc_data=>7.983795404434204, :parse_targeted_inventory=>8.024113893508911, :db_save_inventory=>731.6735608577728, :save_inventory=>731.6747822761536, :ems_refresh=>2239.5073816776276} The fact that server_dequeue is a key in this hash is wrong. This server_dequeue key comes from the parent server process that forked the ems refresh worker. To confirm this fix works, you'd need to first recreate the logging of the timings above with server_dequeue (or server_monitor, or even log_active_servers) showing up in the completely unrelated benchmark and see that these "server" related benchmark timings do not show up in the logs of ems refresh, or cap and u workers...
The 'bug' was usually the default behaviour so it was difficult to say how to reproduce. Any of the following commands would illustrate the problem: grep ems_refresh evm.log | grep Timings grep perf_process evm.log | grep Timings grep perf_rollup evm.log | grep Timings If successive output lines (for the same PID) from any of these commands show the same counter as incrementing (and often being unrealistically large as a result), then this illustrates the problem, i.e. [----] I, [2017-01-27T07:45:02.478585 #26941:66b14c] INFO -- : MIQ(ManageIQ::Providers::Vmware::InfraManager::Vm#perf_rollup) [hourly] Rollup for ManageIQ::Providers::Vmware::InfraManager::Vm name: [VLAB502], id: [1000000000260] for time: [2017-01-27T05:00:00Z]...Complete - Timings: {... :total_time=>2097.766345500946...} [----] I, [2017-01-27T07:45:03.325463 #26941:66b14c] INFO -- : MIQ(ManageIQ::Providers::Vmware::InfraManager::Vm#perf_rollup) [hourly] Rollup for ManageIQ::Providers::Vmware::InfraManager::Vm name: [VSAF002P], id: [1000000000450] for time: [2017-01-27T05:00:00Z]...Complete - Timings: {... :total_time=>2098.3906078338623, ...} If the timing values for successive lines look independent and aren't incrementing, the problem is not showing, or has been fixed.
New commit detected on ManageIQ/manageiq/euwe: https://github.com/ManageIQ/manageiq/commit/05aa7b963f556cd71448f22f38d4af8505aea7e1 commit 05aa7b963f556cd71448f22f38d4af8505aea7e1 Author: Joe Rafaniello <jrafanie> AuthorDate: Tue Apr 25 22:05:41 2017 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Tue Apr 25 22:06:32 2017 -0400 Use pid + thread id to index into the benchmark results https://bugzilla.redhat.com/show_bug.cgi?id=1424716 Note: This is a manual backport of: https://github.com/ManageIQ/manageiq-gems-pending/pull/139 Forks inherit the @@realtime_by_tid and parent/child Thread.current.object_id are equal, so we need to index into the hash with the pid too. The following test recreates and verifies this change but because it forks and doesn't properly bubble up exit codes, I wrote it just to confirm the problem. ```ruby it "doesn't get parent's realtime_block in forked child" do p1_result, p1_timings = Benchmark.realtime_block(:parent_task1) do Timecop.travel(1000) p2_result, p2_timings = Benchmark.realtime_block(:parent_task2) do Timecop.travel(500) end expect(p2_timings[:parent_task2]).to be_within(0.5).of(500) expect(p2_timings.keys).to eq([:parent_task2]) pid = fork do c_result, c_timings = Benchmark.realtime_block(:child_task) do Timecop.travel(250) end expect(c_timings[:child_task]).to be_within(0.5).of(250) expect(c_timings.keys).to eq([:child_task]) end end expect(p1_timings[:parent_task1]).to be_within(0.5).of(1500) expect(p1_timings.keys.sort).to eq([:parent_task1, :parent_task2]) end ``` The tests pass because I'm not handling exit codes properly, but I see the expectation error pop up when I roll back my code change: ``` /Users/joerafaniello/.gem/ruby/2.3.4/gems/rspec-support-3.5.0/lib/rspec/support.rb:87:in `block in <module:Support>': (RSpec::Expectations::ExpectationNotMetError) expected: [:child_task] got: [:parent_task2, :child_task] (compared using ==) Diff: @@ -1,2 +1,2 @@ -[:child_task] +[:parent_task2, :child_task] ``` gems/pending/util/extensions/miq-benchmark.rb | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-)
Verified on 5.9.0.1.