Red Hat Bugzilla – Bug 1251154
Provider refresh duration runs so long that multiple ems_worker processes for the same provider become active concurrently
Last modified: 2018-02-06 17:25:33 EST
Created attachment 1060007 [details]
full evm.log file as gz
Description of problem: Two ems_refresh workers for the same Amazon EC2 provider are active concurrently and both are performing full refreshes.
Version-Release number of selected component (if applicable):
How reproducible: this is the first time I've seen this, so I cannot gauge how reproduceable this is althought it can probably be engineered to occur more frequently for debugging purposes.
Steps to Reproduce:
1. create/find/simulate a provider where the full refresh time is a larger fraction (> 50% ) of the restart interval.
2. wait until the next restart for the ems_refresh worker is shorter than the 'full refresh time' for the associated provider and then force a full ems_refresh
3. monitor the worker types to see that an additional ems_refresh worker for that provider becomes active and initiates a second full refresh.
multiple ems_refresh workers for the same provider become active concurrently
one and only one ems_refresh worker per provider should ever be active in the region/zone.
Additional info: In the customer case, the Amazon EC2 provider seems to have at least 8k VMs and the full_ems_refresh time is exceeding 50 minutes. the default restart_interval of 2.hours is active in this environment and while the first ems_refresh_worker is signaled to exit, it is in the middle of a full refresh and so is blind to the signal. A second/replacement ems_refresh worker is initiated and it also starts a full ems_refresh resulting in significant pressure to the real memory on the appliance and a significant invasion of swap. Other issues (UI workers failing) was reported by the customer, but that is not the problem being presented here.
In this case, it appears that there is not yet a completed full ems_refresh for the provider so there is no information available as to how long the full refresh actually takes.
full appliance logs are provided including a small spreadsheet filtered to focus on the ems_refresh workers, and their start, stop and last recognized log line time to show that two are active concurrently.
Created attachment 1060009 [details]
spreadsheet identifynig AWS ems_refresh workers and highlighting two processes active concurrently
Created attachment 1060010 [details]
grep of first of two ems_refresh worker processes running concurrently
Created attachment 1060011 [details]
grep of second of two ems_refresh workers for same provider running concurrently
This doesn't seem like a scaling issue, but rather an RFE to the scheduler that is smart enough to know that a job is already running.
Ok, so this is what I read:
1. Make refreshes respond to an exit signal
2. Make refreshes quicker
3. Make refreshes take up less memory (response to comment 6 and we don't have the resources to do what the customer wants)
It sounds like increasing resources is your work around for now.
Tom, Did I understand you correctly that you want me to focus on issue 1, properly respond to exit signals?
Have you seen this problem with a more recent build than 5.4?
Discounting the fact that this particular case is 15 months old, it seems to me that the key issue is that once CFME has determined that a worker process must end, it is appropriate that CFME signal the worker to terminate itself, allow some period of time for the worker to recognized that signal and respond, and if there is no response after that period, the process should be killed and the miq_worker instances marked in such a way that the system recognized that a new worker is needed.
This would prevent overloading the limited resources of an appliance by allowing multiple redundant workers to exceed the specified explicit or implicit worker target count, which I assume to be "1 and only one' for provider refresh workers,
but would also prevent other cases which are now in the latest code base (5.6.x) becomming recognized.
@GreGT could you guys make sure the 1. from comment #7 is working?
I am fixing 2. with comment #8, 30k templates and 10k Vms should be down from 3+hours to around 15 minutes, + time to fetch the data from the API.
For point 3 from #7, that is a plan for the future, it's not easily solvable and backportable.
I meant 2. is fixed with comment #9
Also, https://github.com/ManageIQ/manageiq/pull/12360 speeds up the refresh
New commit detected on ManageIQ/manageiq/master:
Author: Ladislav Smola <email@example.com>
AuthorDate: Mon Nov 14 21:56:18 2016 +0100
Commit: Ladislav Smola <firstname.lastname@example.org>
CommitDate: Wed Nov 16 19:23:13 2016 +0100
Remove O(n^2) when updating VMs
The the code that does O(n) search on array, is causing O(n^2) in the
VmOrTemplate saving code. Then the refresh of 30k templates and
10k Vms would take 3+hours and about 250ms per saving VmOrTemplate
Using a simple hash instead of an array for all creates and updates,
we save one VmOrTemplate for about 9ms (if packed in one transaction)
So the whole saving code is down to 15 minutes, from 3+hours.
app/models/ems_refresh/save_inventory.rb | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)