| Summary: | post 4.0-->4.1 upgrade: evm_worker_memory_exceeded and workers stopped | ||
|---|---|---|---|
| Product: | Red Hat CloudForms Management Engine | Reporter: | Colin Arnott <carnott> |
| Component: | Performance | Assignee: | Nick LaMuro <nlamuro> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | luke couzens <lcouzens> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | unspecified | CC: | abellott, benglish, dajohnso, dmetzger, gekis, jhardy, jrafanie, kbrock, ncarboni, obarenbo, simaishi |
| Target Milestone: | GA | ||
| Target Release: | cfme-future | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | perf:upgrade:worker | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-12-01 13:11:28 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Colin Arnott
2016-09-20 20:32:24 UTC
Looks like something for the Performance team. Changing component. Do you have a copy of the log files? Or something that we can work with? This is a little tricky to debug from the provided information thanks Nick, this sounds related: https://bugzilla.redhat.com/show_bug.cgi?id=1391687 (memory thresholds for specific workers need to increased since they weren't bumped when we moved to the generational GC of ruby 2.1+. Note, the swap invasion still sounds like possibly a different problem. If we keep recycling workers, it's possible we start news ones before the old one gets killed and we end up swapping. Either way, we should try to get logs and see if that's what is happening. Joe makes a good point, and what should be done is see if the following changes help:
Change default worker from 200 to 400 MB.
ems_refresh_core_worker inherits this 400 MB default.
Change default queue worker from 400 to 500 MB.
generic_worker inherits this 500 MB value.
Change ems_metrics_processor_worker from 400 to 600 MB.
Change priority_worker from the old inherited queue worker value of 400
to a customized 600 MB.
That said, while bumping up the memory will probably help with the issue at hand, if there is a specific job with this client that is acting up, there currently isn't enough info to point us to what that is. Having logs to look at and a general idea of the scale of the client in question (number of VMs, what type of providers they are using, etc.) would help tremendously to narrow down the scope of this issue, and what jobs specifically are causing the problems.
Without it, I don't have much else to recommend besides to try upping the memory values for the workers (as Joe also suggested).
Closing the ticket since the reported problem is not longer reproducible at the originators site. |