Bug 1251154

Summary: Provider refresh duration runs so long that multiple ems_worker processes for the same provider become active concurrently
Product: Red Hat CloudForms Management Engine Reporter: Thomas Hennessy <thenness>
Component: PerformanceAssignee: Ladislav Smola <lsmola>
Status: ON_QA --- QA Contact: Pradeep Kumar Surisetty <psuriset>
Severity: high Docs Contact:
Priority: medium    
Version: 5.4.0CC: clasohm, cpelland, dajohnso, gblomqui, gtanzill, hroy, igortiunov, jdeubel, jhardy, jocarter, lsmola, mfeifer, mhild, simaishi
Target Milestone: GAKeywords: TestOnly, ZStream
Target Release: 5.8.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: ems_refresh:perf
Fixed In Version: 5.8.0.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1290217 1412729 (view as bug list) Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1412729, 1290217    
Attachments:
Description Flags
full evm.log file as gz
none
spreadsheet identifynig AWS ems_refresh workers and highlighting two processes active concurrently
none
grep of first of two ems_refresh worker processes running concurrently
none
grep of second of two ems_refresh workers for same provider running concurrently none

Description Thomas Hennessy 2015-08-06 14:58:14 UTC
Created attachment 1060007 [details]
full evm.log file as gz

Description of problem: Two ems_refresh workers for the same Amazon EC2 provider are active concurrently and both are performing full refreshes.


Version-Release number of selected component (if applicable):
 Version: 5.4.0.5
 Build:   20150605150206_7daa1a8 


How reproducible: this is the first time I've seen this, so I cannot gauge how reproduceable this is althought it can probably be engineered to occur more frequently for debugging purposes.


Steps to Reproduce:
1. create/find/simulate a provider where the full refresh time is a larger fraction (> 50% ) of the restart interval.
2. wait until the next restart for the ems_refresh worker is shorter than the 'full refresh time' for the associated provider and then force a full ems_refresh
3.  monitor the worker types to see that an additional ems_refresh worker for that provider becomes active and initiates a second full refresh.

Actual results:
multiple ems_refresh workers for the same provider become active concurrently


Expected results:
one and only one ems_refresh worker per provider should ever be active in the region/zone.


Additional info:  In the customer case, the Amazon EC2 provider seems to have at least 8k VMs and the full_ems_refresh time is exceeding 50 minutes.  the default restart_interval of 2.hours is active in this environment and while the first ems_refresh_worker is signaled to exit, it is in the middle of a full refresh and so is blind to the signal.  A second/replacement ems_refresh worker is initiated and it also starts a full ems_refresh resulting in significant pressure to the real memory on the appliance and a significant invasion of swap.  Other issues (UI workers failing) was reported by the customer, but that is not the problem being presented here.

In this case, it appears that there is not yet a completed full ems_refresh for the provider so there is no information available as to how long the full refresh actually takes. 

full appliance logs are provided including a small spreadsheet filtered to focus on the ems_refresh workers, and their start, stop and last recognized log line time to show that two are active concurrently.

Comment 1 Thomas Hennessy 2015-08-06 15:03:41 UTC
Created attachment 1060009 [details]
spreadsheet identifynig AWS ems_refresh workers and highlighting two processes active concurrently

Comment 2 Thomas Hennessy 2015-08-06 15:09:13 UTC
Created attachment 1060010 [details]
grep of first of two ems_refresh worker processes running concurrently

Comment 3 Thomas Hennessy 2015-08-06 15:10:01 UTC
Created attachment 1060011 [details]
grep of second of two ems_refresh workers for same provider running concurrently

Comment 5 Keenan Brock 2016-01-12 03:45:26 UTC
This doesn't seem like a scaling issue, but rather an RFE to the scheduler that is smart enough to know that a job is already running.

Comment 7 Keenan Brock 2016-11-11 21:45:39 UTC
Ok, so this is what I read:

1. Make refreshes respond to an exit signal
2. Make refreshes quicker
3. Make refreshes take up less memory (response to comment 6 and we don't have the resources to do what the customer wants)


It sounds like increasing resources is your work around for now.
Tom, Did I understand you correctly that you want me to focus on issue 1, properly respond to exit signals?
Have you seen this problem with a more recent build than 5.4?

Thanks,
Keenan

Comment 8 Thomas Hennessy 2016-11-14 14:05:33 UTC
Keenan,
Discounting the fact that this particular case is 15 months old, it seems to me that the key issue is that once CFME has determined that a worker process must end, it is appropriate that CFME signal the worker to terminate itself, allow some period of time for the worker to recognized that signal and respond, and if there is no response after that period, the process should be killed and the miq_worker instances marked in such a way that the system recognized that a new worker is needed.

This would prevent overloading the limited resources of an appliance by allowing multiple redundant workers to exceed the  specified explicit or implicit worker target count, which I assume to be "1 and only one' for provider refresh workers,
but would also prevent other cases which are now in the latest code base (5.6.x) becomming recognized.

Tom Hennessy

Comment 10 Ladislav Smola 2016-11-14 22:10:41 UTC
@GreGT could you guys make sure the 1. from comment #7 is working?

I am fixing 2. with comment #8, 30k templates and 10k Vms should be down from 3+hours to around 15 minutes, + time to fetch the data from the API.

For point 3 from #7, that is a plan for the future, it's not easily solvable and backportable.

Comment 11 Ladislav Smola 2016-11-14 22:12:15 UTC
I meant 2. is fixed with comment #9

Comment 12 Keenan Brock 2016-11-15 19:46:54 UTC
Also, https://github.com/ManageIQ/manageiq/pull/12360 speeds up the refresh

Comment 13 CFME Bot 2016-11-22 18:01:37 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/a6f07f2d9baf29652179e18cbd9aed3ceb184e4d

commit a6f07f2d9baf29652179e18cbd9aed3ceb184e4d
Author:     Ladislav Smola <lsmola>
AuthorDate: Mon Nov 14 21:56:18 2016 +0100
Commit:     Ladislav Smola <lsmola>
CommitDate: Wed Nov 16 19:23:13 2016 +0100

    Remove O(n^2) when updating VMs
    
    The the code that does O(n) search on array, is causing O(n^2) in the
    VmOrTemplate saving code. Then the refresh of 30k templates and
    10k Vms would take 3+hours and about 250ms per saving VmOrTemplate
    
    Using a simple hash instead of an array for all creates and updates,
    we save one VmOrTemplate for about 9ms (if packed in one transaction)
    
    So the whole saving code is down to 15 minutes, from 3+hours.
    
    Fixes BZ:
    https://bugzilla.redhat.com/show_bug.cgi?id=1251154

 app/models/ems_refresh/save_inventory.rb | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)