Hide Forgot
Description of problem: Following a recent upgrade from 4.0 to 4.1, I am now getting a large amount of evm_worker_memory_exceeded loglines, worker threads being stopped and invasion of swap. I am told that an issue similar to this arose when this appliance was upgraded from 3.2 to 4.0, but that issue (1322485) is closed and has been merged into 4.1 Version-Release number of selected component (if applicable): cfme-5.6.1.2 How reproducible: on this environment: very others: not Actual results: [----] I, [2016-09-20T13:01:09.686855 #27855:10df998] INFO -- : Followed Relationship [miqaedb:/System/Event/MiqEvent/POLICY/evm_worker_memory_exceeded#create] Expected results: normal operation Additional info: (logs pending)
Looks like something for the Performance team. Changing component.
Do you have a copy of the log files? Or something that we can work with? This is a little tricky to debug from the provided information thanks
Nick, this sounds related: https://bugzilla.redhat.com/show_bug.cgi?id=1391687 (memory thresholds for specific workers need to increased since they weren't bumped when we moved to the generational GC of ruby 2.1+. Note, the swap invasion still sounds like possibly a different problem. If we keep recycling workers, it's possible we start news ones before the old one gets killed and we end up swapping. Either way, we should try to get logs and see if that's what is happening.
Joe makes a good point, and what should be done is see if the following changes help: Change default worker from 200 to 400 MB. ems_refresh_core_worker inherits this 400 MB default. Change default queue worker from 400 to 500 MB. generic_worker inherits this 500 MB value. Change ems_metrics_processor_worker from 400 to 600 MB. Change priority_worker from the old inherited queue worker value of 400 to a customized 600 MB. That said, while bumping up the memory will probably help with the issue at hand, if there is a specific job with this client that is acting up, there currently isn't enough info to point us to what that is. Having logs to look at and a general idea of the scale of the client in question (number of VMs, what type of providers they are using, etc.) would help tremendously to narrow down the scope of this issue, and what jobs specifically are causing the problems. Without it, I don't have much else to recommend besides to try upping the memory values for the workers (as Joe also suggested).
Closing the ticket since the reported problem is not longer reproducible at the originators site.