Bug 1251154 - Provider refresh duration runs so long that multiple ems_worker processes for the same provider become active concurrently
Provider refresh duration runs so long that multiple ems_worker processes for...
Status: ON_QA
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Performance (Show other bugs)
5.4.0
x86_64 Linux
medium Severity high
: GA
: 5.8.0
Assigned To: Ladislav Smola
Pradeep Kumar Surisetty
ems_refresh:perf
: TestOnly, ZStream
Depends On:
Blocks: 1412729 1290217
  Show dependency treegraph
 
Reported: 2015-08-06 10:58 EDT by Thomas Hennessy
Modified: 2018-02-06 17:25 EST (History)
18 users (show)

See Also:
Fixed In Version: 5.8.0.0
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1290217 1412729 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
full evm.log file as gz (8.88 MB, application/x-gzip)
2015-08-06 10:58 EDT, Thomas Hennessy
no flags Details
spreadsheet identifynig AWS ems_refresh workers and highlighting two processes active concurrently (19.49 KB, application/zip)
2015-08-06 11:03 EDT, Thomas Hennessy
no flags Details
grep of first of two ems_refresh worker processes running concurrently (411.63 KB, application/zip)
2015-08-06 11:09 EDT, Thomas Hennessy
no flags Details
grep of second of two ems_refresh workers for same provider running concurrently (183.23 KB, application/zip)
2015-08-06 11:10 EDT, Thomas Hennessy
no flags Details

  None (edit)
Description Thomas Hennessy 2015-08-06 10:58:14 EDT
Created attachment 1060007 [details]
full evm.log file as gz

Description of problem: Two ems_refresh workers for the same Amazon EC2 provider are active concurrently and both are performing full refreshes.


Version-Release number of selected component (if applicable):
 Version: 5.4.0.5
 Build:   20150605150206_7daa1a8 


How reproducible: this is the first time I've seen this, so I cannot gauge how reproduceable this is althought it can probably be engineered to occur more frequently for debugging purposes.


Steps to Reproduce:
1. create/find/simulate a provider where the full refresh time is a larger fraction (> 50% ) of the restart interval.
2. wait until the next restart for the ems_refresh worker is shorter than the 'full refresh time' for the associated provider and then force a full ems_refresh
3.  monitor the worker types to see that an additional ems_refresh worker for that provider becomes active and initiates a second full refresh.

Actual results:
multiple ems_refresh workers for the same provider become active concurrently


Expected results:
one and only one ems_refresh worker per provider should ever be active in the region/zone.


Additional info:  In the customer case, the Amazon EC2 provider seems to have at least 8k VMs and the full_ems_refresh time is exceeding 50 minutes.  the default restart_interval of 2.hours is active in this environment and while the first ems_refresh_worker is signaled to exit, it is in the middle of a full refresh and so is blind to the signal.  A second/replacement ems_refresh worker is initiated and it also starts a full ems_refresh resulting in significant pressure to the real memory on the appliance and a significant invasion of swap.  Other issues (UI workers failing) was reported by the customer, but that is not the problem being presented here.

In this case, it appears that there is not yet a completed full ems_refresh for the provider so there is no information available as to how long the full refresh actually takes. 

full appliance logs are provided including a small spreadsheet filtered to focus on the ems_refresh workers, and their start, stop and last recognized log line time to show that two are active concurrently.
Comment 1 Thomas Hennessy 2015-08-06 11:03:41 EDT
Created attachment 1060009 [details]
spreadsheet identifynig AWS ems_refresh workers and highlighting two processes active concurrently
Comment 2 Thomas Hennessy 2015-08-06 11:09:13 EDT
Created attachment 1060010 [details]
grep of first of two ems_refresh worker processes running concurrently
Comment 3 Thomas Hennessy 2015-08-06 11:10:01 EDT
Created attachment 1060011 [details]
grep of second of two ems_refresh workers for same provider running concurrently
Comment 5 Keenan Brock 2016-01-11 22:45:26 EST
This doesn't seem like a scaling issue, but rather an RFE to the scheduler that is smart enough to know that a job is already running.
Comment 7 Keenan Brock 2016-11-11 16:45:39 EST
Ok, so this is what I read:

1. Make refreshes respond to an exit signal
2. Make refreshes quicker
3. Make refreshes take up less memory (response to comment 6 and we don't have the resources to do what the customer wants)


It sounds like increasing resources is your work around for now.
Tom, Did I understand you correctly that you want me to focus on issue 1, properly respond to exit signals?
Have you seen this problem with a more recent build than 5.4?

Thanks,
Keenan
Comment 8 Thomas Hennessy 2016-11-14 09:05:33 EST
Keenan,
Discounting the fact that this particular case is 15 months old, it seems to me that the key issue is that once CFME has determined that a worker process must end, it is appropriate that CFME signal the worker to terminate itself, allow some period of time for the worker to recognized that signal and respond, and if there is no response after that period, the process should be killed and the miq_worker instances marked in such a way that the system recognized that a new worker is needed.

This would prevent overloading the limited resources of an appliance by allowing multiple redundant workers to exceed the  specified explicit or implicit worker target count, which I assume to be "1 and only one' for provider refresh workers,
but would also prevent other cases which are now in the latest code base (5.6.x) becomming recognized.

Tom Hennessy
Comment 10 Ladislav Smola 2016-11-14 17:10:41 EST
@GreGT could you guys make sure the 1. from comment #7 is working?

I am fixing 2. with comment #8, 30k templates and 10k Vms should be down from 3+hours to around 15 minutes, + time to fetch the data from the API.

For point 3 from #7, that is a plan for the future, it's not easily solvable and backportable.
Comment 11 Ladislav Smola 2016-11-14 17:12:15 EST
I meant 2. is fixed with comment #9
Comment 12 Keenan Brock 2016-11-15 14:46:54 EST
Also, https://github.com/ManageIQ/manageiq/pull/12360 speeds up the refresh
Comment 13 CFME Bot 2016-11-22 13:01:37 EST
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/a6f07f2d9baf29652179e18cbd9aed3ceb184e4d

commit a6f07f2d9baf29652179e18cbd9aed3ceb184e4d
Author:     Ladislav Smola <lsmola@redhat.com>
AuthorDate: Mon Nov 14 21:56:18 2016 +0100
Commit:     Ladislav Smola <lsmola@redhat.com>
CommitDate: Wed Nov 16 19:23:13 2016 +0100

    Remove O(n^2) when updating VMs
    
    The the code that does O(n) search on array, is causing O(n^2) in the
    VmOrTemplate saving code. Then the refresh of 30k templates and
    10k Vms would take 3+hours and about 250ms per saving VmOrTemplate
    
    Using a simple hash instead of an array for all creates and updates,
    we save one VmOrTemplate for about 9ms (if packed in one transaction)
    
    So the whole saving code is down to 15 minutes, from 3+hours.
    
    Fixes BZ:
    https://bugzilla.redhat.com/show_bug.cgi?id=1251154

 app/models/ems_refresh/save_inventory.rb | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

Note You need to log in before you can comment on or make changes to this bug.