1535177 – [RFE][M-5] "Out of memory worker exceeded" verbosity for end user

Bug 1535177 - [RFE][M-5] "Out of memory worker exceeded" verbosity for end user

Summary: [RFE][M-5] "Out of memory worker exceeded" verbosity for end user

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.9.0
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	MVP
Target Release:	5.10.0
Assignee:	Keenan Brock
QA Contact:	Tasos Papaioannou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1555371
TreeView+	depends on / blocked

Reported:	2018-01-16 19:12 UTC by Josh Carter
Modified:	2019-02-07 23:01 UTC (History)
CC List:	12 users (show)
Fixed In Version:	5.10.0.18
Doc Type:	Enhancement
Doc Text:	Alert emails now include the text displayed in a notification.
Clone Of:
Environment:
Last Closed:	2019-02-07 23:00:57 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2019:0212	0	None	None	None	2019-02-07 23:01:10 UTC

Comment 6 Keenan Brock 2018-06-18 19:16:06 UTC

We are generating an MiqEvent "evm_worker_memory_exceeded"
It looks like it is working but am diving deeper to see if a bug is preventing this from surfacing

Comment 7 CFME Bot 2018-06-26 17:56:43 UTC

https://github.com/ManageIQ/manageiq/pull/17640

Comment 8 Keenan Brock 2018-06-27 16:32:30 UTC

dmetzger57:

circling back before closing

> Defining a new synthetic event which is triggered when a worker has been identified as exceeding its associated memory threshold. 

That sounds like we want to raise an event if the worker goes over memory threshold.
We currently raise an event and apply a policy.
I had read this as adding a ui notification. (now I'm not so sure)

> Should de-duplication be implemented so that at most 1 event is generated when a worker exceeds the threshold

Don't think we do this anywhere.

> It is possible for a single [worker] to be identified as exceeding the threshold if it does not complete termination

This sounds like doubling down on the de-duplication statement. Serious concern that that there will be way too many over memory exceptions.

? Is this a request, or an optimization to reduce the number of events generated?

Comment 9 Greg Blomquist 2018-07-02 17:57:49 UTC

Dennis, see Keenan's comment #8.

Comment 10 dmetzger 2018-07-03 13:59:53 UTC

It was a request to ensure we don't spam the user with multiple memory exceeded events for the same worker once it has exceeded the threshold but does not terminate for a long time - in which case it could be found multiple times in the "exceeded" state. From out of band conversations, it sounds like the code already handles this in the check logic.

Comment 11 CFME Bot 2018-07-06 00:21:13 UTC

https://github.com/ManageIQ/manageiq/pull/17673

Comment 12 CFME Bot 2018-08-13 18:06:13 UTC

New commit detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/ebd613a5a6deecd0abf0e47b6bf8a620e516a749
commit ebd613a5a6deecd0abf0e47b6bf8a620e516a749
Author:     Keenan Brock <keenan>
AuthorDate: Thu Jun 21 17:04:31 2018 -0400
Commit:     Keenan Brock <keenan>
CommitDate: Thu Jun 21 17:04:31 2018 -0400

    notify users of killing workers when exceed memory

    - Provide more complete information for worker out of memory errors
    - Provide notification messages for out of memory errors
    - Introduce ability to disable notification
    - Include notification.message in alert emails

    https://bugzilla.redhat.com/show_bug.cgi?id=1535177

 app/models/miq_action.rb | 1 +
 app/models/miq_event.rb | 7 +-
 app/models/miq_server/worker_management/monitor/validation.rb | 10 +-
 app/models/notification.rb | 11 +-
 app/models/notification_type.rb | 10 +
 db/fixtures/notification_types.yml | 5 +
 spec/models/miq_action_spec.rb | 46 +
 spec/models/miq_event_spec.rb | 22 +-
 spec/models/notification_spec.rb | 21 +
 9 files changed, 121 insertions(+), 12 deletions(-)

Comment 13 Keenan Brock 2018-08-13 18:16:30 UTC

Merged into master

Notifications have a template to display information for some EventsStream events.
This notification event is now included in alert policy emails.

More information can be found in https://github.com/ManageIQ/manageiq/pull/17673

Comment 14 Tasos Papaioannou 2018-09-05 18:49:18 UTC

Tested on 5.10.0.14. The alert emails do not include the additional information mentioned in https://github.com/ManageIQ/manageiq/pull/17673 :

****
Before

Alert Triggered
Alert 'aKB Worker killed', triggered
Event: 	Alert condition met
Entity: 	(MiqServer) local

After

Alert Triggered
Alert 'aKB Worker killed', triggered
Event: 	Alert condition met
Entity: 	(MiqServer) local
Details: 	Killing worker MiqPriorityWorker due to excessive memory usage. 144.6 MB used memory exceeds limit of 100 MB.
****

My alert emails still look like:

****
Alert Triggered
Alert 'tpa', triggered
Event: 	Alert condition met
Entity: 	(MiqServer) EVM
****

I'm guessing this is because the evm_worker_memory_exceeded notification type added to db/fixtures/notification_types.yml in this commit did not actually get added to the database:

vmdb_production=# select * from notification_types where name like 'evm_worker_memory_exceeded';
 id | name | level | audience | message | expires_in 
----+------+-------+----------+---------+------------
(0 rows)

Comment 15 Keenan Brock 2018-09-10 14:26:07 UTC

Tasos,
I was expecting those changes to get seeded to the database.

Thanks for the very clear issue description - I'm looking into this now

Comment 16 Keenan Brock 2018-09-13 19:57:33 UTC

Tasos,

A new database, or a database that comes with a nightly build should have these new values.

If you have an existing database, the templates will not get automatically updated at server startup. We do auto update a few of the tables, but this is not one of them.

Is it possible to run the following script from a shell prompt and then restart/start the server?
This will not change any EMS or VM values that you already have in the database. It will only bring the default values up to date.

    vmdb # change to the manageiq directory
    rake db:seed # populate the default tables.

This will populate the database with the proper templates.

Thank you,

Keenan

Comment 17 Tasos Papaioannou 2018-09-17 20:05:17 UTC

Tested on the latest build 5.10.0.15 (9/14), and the notification type record is still missing from the db. Running 'rake db:seed' completes successfully but doesn't resolve the issue either.

Comment 18 Keenan Brock 2018-09-28 17:27:28 UTC

Sorry for the delay Tasos. Was spinning my wheels because I did not properly assign this to a server.

I found the issue and have put in a PR.


My reproduction steps:
(reproduction steps are so I don't spin my wheels again)

You probably want to just use the UI for this configuration instead of my hack

vmdb/config/settings/production.local.yml
#########

:smtp:
  :authentication: 'plain'
  :domain: mydomain.com
  :from: cfadmin
  :host: 127.0.0.1
  :port: '1025'

:workers:
  :worker_base:
    :queue_worker_base:
      :priority_worker:
        # this will force priority worker to keep restarting
        :memory_threshold: 100.megabytes
        :count: 1

######

In the UI:

Administrator Menu: Configuration (Upper Right)
Tree view: Settings > Zones > Zone: Default > Server: EVM[1] Current (this is default for me)

Server Control
  Server Roles
    Notifier: enable
    (Provider Inventory: disable - if you want to cut down on server work)

Left Menu: Control > Explorer
Tree view: Alert
Menu: Configuration > Create Alert

name: aaRestartWorker
enabled: true
based upon: Server
evaluate: Nothing
Driving Event: Appliance Operation: Worker Exceeded Memory
Notification Frequency: 1 minute
Send Email: true
To: problem
  Use "Add (enter Manually)" to enter
  Ensure the name shows up in the To box
  You probably have a different value
save

Tree view: Alert Profile
Tree view: All Alert Profiles > Server Alert Profiles
Menu: Configuration > Add Alert Profile

Description: aaRestartWorkerProfile
Move aaRestart to right
save

Tree view: aaRestartWorkerProfile
Menu: Configuration > Edit Alert Assignments for this profile

Choose all servers (only 1 server "EVM" for me)
save

Comment 19 CFME Bot 2018-09-28 17:36:03 UTC

https://github.com/ManageIQ/manageiq/pull/18035

Comment 20 CFME Bot 2018-09-28 22:31:23 UTC

New commit detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/acd0a192b30b09850595ccda6d20daa348384286
commit acd0a192b30b09850595ccda6d20daa348384286
Author:     Keenan Brock <keenan>
AuthorDate: Fri Sep 28 13:11:20 2018 -0400
Commit:     Keenan Brock <keenan>
CommitDate: Fri Sep 28 13:11:20 2018 -0400

    Fix Notifications None

    fixes ebd613a5

    The notification type "NONE" was introduced, but it was not part of
    basic validations

    https://bugzilla.redhat.com/show_bug.cgi?id=1535177
 app/models/notification_type.rb | 2 +-
 spec/models/notification_type_spec.rb | 8 +
 2 files changed, 9 insertions(+), 1 deletion(-)

Comment 21 CFME Bot 2018-10-01 16:57:24 UTC

New commit detected on ManageIQ/manageiq/hammer:

https://github.com/ManageIQ/manageiq/commit/33aa42d74dafb5cc00206c4af28dfef3ca079ace
commit 33aa42d74dafb5cc00206c4af28dfef3ca079ace
Author:     Brandon Dunne <brandondunne>
AuthorDate: Fri Sep 28 18:26:33 2018 -0400
Commit:     Brandon Dunne <brandondunne>
CommitDate: Fri Sep 28 18:26:33 2018 -0400

    Merge pull request #18035 from kbrock/monitor_notifier_fix

    Fix Notifications None

    (cherry picked from commit 8489078c17dcc236541e3c3cebdf20bb2e6291c1)

    https://bugzilla.redhat.com/show_bug.cgi?id=1535177

 app/models/notification_type.rb | 2 +-
 spec/models/notification_type_spec.rb | 8 +
 2 files changed, 9 insertions(+), 1 deletion(-)

Comment 22 Tasos Papaioannou 2018-10-04 15:09:29 UTC

Verified on 5.10.0.18.

Comment 23 errata-xmlrpc 2019-02-07 23:00:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:0212

Note You need to log in before you can comment on or make changes to this bug.