1377866 – post 4.0-->4.1 upgrade: evm_worker_memory_exceeded and workers stopped

Bug 1377866 - post 4.0-->4.1 upgrade: evm_worker_memory_exceeded and workers stopped

Summary: post 4.0-->4.1 upgrade: evm_worker_memory_exceeded and workers stopped

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Performance
Sub Component:
Version:	unspecified
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	cfme-future
Assignee:	Nick LaMuro
QA Contact:	luke couzens
Docs Contact:
URL:
Whiteboard:	perf:upgrade:worker
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-09-20 20:32 UTC by Colin Arnott
Modified:	2019-12-16 06:50 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-12-01 13:11:28 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Colin Arnott 2016-09-20 20:32:24 UTC

Description of problem:
Following a recent upgrade from 4.0 to 4.1, I am now getting a large amount of evm_worker_memory_exceeded loglines, worker threads being stopped and invasion of swap.

I am told that an issue similar to this arose when this appliance was upgraded from 3.2 to 4.0, but that issue (1322485) is closed and has been merged into 4.1


Version-Release number of selected component (if applicable):
cfme-5.6.1.2

How reproducible:
 on this environment: very
 others: not

Actual results:
[----] I, [2016-09-20T13:01:09.686855 #27855:10df998]  INFO -- : Followed  Relationship [miqaedb:/System/Event/MiqEvent/POLICY/evm_worker_memory_exceeded#create]


Expected results:
normal operation

Additional info:
(logs pending)

Comment 3 Nick Carboni 2016-09-21 14:16:22 UTC

Looks like something for the Performance team.

Changing component.

Comment 8 Keenan Brock 2016-11-07 18:40:18 UTC

Do you have a copy of the log files?

Or something that we can work with?
This is a little tricky to debug from the provided information

thanks

Comment 10 Joe Rafaniello 2016-11-08 19:57:06 UTC

Nick, this sounds related:  https://bugzilla.redhat.com/show_bug.cgi?id=1391687 (memory thresholds for specific workers need to increased since they weren't bumped when we moved to the generational GC of ruby 2.1+.

Note, the swap invasion still sounds like possibly a different problem.  If we keep recycling workers, it's possible we start news ones before the old one gets killed and we end up swapping.  Either way, we should try to get logs and see if that's what is happening.

Comment 11 Nick LaMuro 2016-11-14 20:46:24 UTC

Joe makes a good point, and what should be done is see if the following changes help:

    Change default worker from 200 to 400 MB.
    ems_refresh_core_worker inherits this 400 MB default.
    
    Change default queue worker from 400 to 500 MB.
    generic_worker inherits this 500 MB value.
    
    Change ems_metrics_processor_worker from 400 to 600 MB.
    Change priority_worker from the old inherited queue worker value of 400
    to a customized 600 MB.


That said, while bumping up the memory will probably help with the issue at hand, if there is a specific job with this client that is acting up, there currently isn't enough info to point us to what that is.  Having logs to look at and a general idea of the scale of the client in question (number of VMs, what type of providers they are using, etc.) would help tremendously to narrow down the scope of this issue, and what jobs specifically are causing the problems.

Without it, I don't have much else to recommend besides to try upping the memory values for the workers (as Joe also suggested).

Comment 15 dmetzger 2016-12-01 13:11:28 UTC

Closing the ticket since the reported problem is not longer reproducible at the originators site.

Note You need to log in before you can comment on or make changes to this bug.