Bug 1377866

Summary: post 4.0-->4.1 upgrade: evm_worker_memory_exceeded and workers stopped
Product: Red Hat CloudForms Management Engine Reporter: Colin Arnott <carnott>
Component: PerformanceAssignee: Nick LaMuro <nlamuro>
Status: CLOSED CURRENTRELEASE QA Contact: luke couzens <lcouzens>
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: abellott, benglish, dajohnso, dmetzger, gekis, jhardy, jrafanie, kbrock, ncarboni, obarenbo, simaishi
Target Milestone: GA   
Target Release: cfme-future   
Hardware: x86_64   
OS: Linux   
Whiteboard: perf:upgrade:worker
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-01 13:11:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Colin Arnott 2016-09-20 20:32:24 UTC
Description of problem:
Following a recent upgrade from 4.0 to 4.1, I am now getting a large amount of evm_worker_memory_exceeded loglines, worker threads being stopped and invasion of swap.

I am told that an issue similar to this arose when this appliance was upgraded from 3.2 to 4.0, but that issue (1322485) is closed and has been merged into 4.1


Version-Release number of selected component (if applicable):
cfme-5.6.1.2

How reproducible:
 on this environment: very
 others: not

Actual results:
[----] I, [2016-09-20T13:01:09.686855 #27855:10df998]  INFO -- : Followed  Relationship [miqaedb:/System/Event/MiqEvent/POLICY/evm_worker_memory_exceeded#create]


Expected results:
normal operation

Additional info:
(logs pending)

Comment 3 Nick Carboni 2016-09-21 14:16:22 UTC
Looks like something for the Performance team.

Changing component.

Comment 8 Keenan Brock 2016-11-07 18:40:18 UTC
Do you have a copy of the log files?

Or something that we can work with?
This is a little tricky to debug from the provided information

thanks

Comment 10 Joe Rafaniello 2016-11-08 19:57:06 UTC
Nick, this sounds related:  https://bugzilla.redhat.com/show_bug.cgi?id=1391687 (memory thresholds for specific workers need to increased since they weren't bumped when we moved to the generational GC of ruby 2.1+.

Note, the swap invasion still sounds like possibly a different problem.  If we keep recycling workers, it's possible we start news ones before the old one gets killed and we end up swapping.  Either way, we should try to get logs and see if that's what is happening.

Comment 11 Nick LaMuro 2016-11-14 20:46:24 UTC
Joe makes a good point, and what should be done is see if the following changes help:

    Change default worker from 200 to 400 MB.
    ems_refresh_core_worker inherits this 400 MB default.
    
    Change default queue worker from 400 to 500 MB.
    generic_worker inherits this 500 MB value.
    
    Change ems_metrics_processor_worker from 400 to 600 MB.
    Change priority_worker from the old inherited queue worker value of 400
    to a customized 600 MB.


That said, while bumping up the memory will probably help with the issue at hand, if there is a specific job with this client that is acting up, there currently isn't enough info to point us to what that is.  Having logs to look at and a general idea of the scale of the client in question (number of VMs, what type of providers they are using, etc.) would help tremendously to narrow down the scope of this issue, and what jobs specifically are causing the problems.

Without it, I don't have much else to recommend besides to try upping the memory values for the workers (as Joe also suggested).

Comment 15 dmetzger 2016-12-01 13:11:28 UTC
Closing the ticket since the reported problem is not longer reproducible at the originators site.