Bug 1533484 - Miq Server leaks memory and we fail to detect and remediate it
Summary: Miq Server leaks memory and we fail to detect and remediate it
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Appliance
Version: 5.8.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.10.0
Assignee: Joe Rafaniello
QA Contact: Dave Johnson
URL:
Whiteboard:
Depends On:
Blocks: 1537131 1537132
TreeView+ depends on / blocked
 
Reported: 2018-01-11 13:29 UTC by dmetzger
Modified: 2019-08-22 00:54 UTC (History)
5 users (show)

Fixed In Version: 5.10.0.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1537131 1537132 (view as bug list)
Environment:
Last Closed: 2018-06-21 20:26:52 UTC
Category: ---
Cloudforms Team: CFME Core
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description dmetzger 2018-01-11 13:29:28 UTC
Description of problem:

The Miq Server leaks memory over time. The leak rate is dependent on the workload, it has been seen to be as high as 500Mb per day.

Leak has been observed on 5.7 and 5.9 versions of the server.

Version-Release number of selected component (if applicable):


How reproducible:

Always, the rate of leak is variable.

Steps to Reproduce:
1. Boot appliance
2. Monitor size (RSS/USS/PSS) size of the Miq Server process
3.

Actual results:

Server experiences unlimited memory size increases over time

Server process reaches a steady state memory size and remains at that size.

Expected results:


Additional info:

Comment 3 CFME Bot 2018-01-12 14:01:59 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/529123f213109ddbe95bb800549b692761d54dea

commit 529123f213109ddbe95bb800549b692761d54dea
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Wed Jan 10 14:54:53 2018 -0500
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Thu Jan 11 17:16:14 2018 -0500

    Server shuts down if memory usage exceeds a limit
    
    Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1533484
    
    Previously, the server could slowly grow in memory usage and nothing was
    monitoring this.  Now, we use the memory usage we already were
    collecting and react to it by shutting down the server process if it
    exceeds the configurable threshold.  We rely upon systemd, or kubernetes
    in the future, to see our non-zero exit code and automatically start a
    new server process within a few minutes.
    
    Note, we are trying to be "graceful" in that we call shutdown_and_exit
    with a non-zero exit code.  This will deactivate the roles on the
    server and attempt to gracefully shutdown the workers.
    
    Additionally, we create a notification so an administrator can look
    into it and report any future leaks.

 app/models/miq_server.rb           | 21 +++++++++++++++++++++
 config/settings.yml                |  1 +
 db/fixtures/notification_types.yml |  5 +++++
 spec/models/miq_server_spec.rb     | 24 ++++++++++++++++++++++++
 4 files changed, 51 insertions(+)

Comment 4 Joe Rafaniello 2018-01-12 16:59:12 UTC
Steps for Testing/QA [Optional]
-------------------------------

Change the memory_threshold in the `:server` section in advanced settings

From:
`:memory_threshold: 2.gigabytes`

To: a number a bit larger than the current process.

Wait for the server to grow and exceed that limit.

You'll see a notification in the UI after the server restarts:

![image](https://user-images.githubusercontent.com/19339/34850556-5280c77c-f6f4-11e7-9cb4-b8b3ec84ce6e.png)


You'll see this in the log:

Memory exceeding limit and shutdown started:
```
[----] W, [2018-01-11T17:22:08.078414 #25874:10d1140]  WARN -- : MIQ(MiqServer#monitor_myself) server(pid: 25874, name: EVM) memory usage [357347328] exceeded limit: [346030080].  Exiting server process.
[----] I, [2018-01-11T17:22:08.142178 #25874:10d1140]  INFO -- : MIQ(MiqServer#shutdown) initiated for MiqServer [EVM] with ID: [1], PID: [25874], GUID: [6d28a0a7-3b2b-4da4-9306-6799c74a4295]
[----] I, [2018-01-11T17:22:08.166732 #25874:10d1140]  INFO -- : MIQ(MiqQueue.put) Message id: [10839],  id: [], Zone: [default], Role: [automate], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [20], State: [ready], Deliver On: [], Data: [], Args: [{:object_type=>"MiqServer", :object_id=>1, :attrs=>{:event_type=>"evm_server_stop", "MiqEvent::miq_event"=>612, :miq_event_id=>612, "EventStream::event_stream"=>612, :event_stream_id=>612}, :instance_name=>"Event", :user_id=>1, :miq_group_id=>2, :tenant_id=>1, :automate_message=>nil}]
```

Roles are deactivated:
```
[----] I, [2018-01-11T17:22:08.175980 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <ems_inventory> on Server <EVM>
[----] I, [2018-01-11T17:22:08.178987 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <ems_operations> on Server <EVM>
[----] I, [2018-01-11T17:22:08.181984 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <event> on Server <EVM>
[----] I, [2018-01-11T17:22:08.185387 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <smartstate> on Server <EVM>
[----] I, [2018-01-11T17:22:08.188961 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <automate> on Server <EVM>
[----] I, [2018-01-11T17:22:08.193324 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <database_operations> on Server <EVM>
[----] I, [2018-01-11T17:22:08.197099 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <reporting> on Server <EVM>
[----] I, [2018-01-11T17:22:08.200250 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <scheduler> on Server <EVM>
[----] I, [2018-01-11T17:22:08.203034 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <user_interface> on Server <EVM>
[----] I, [2018-01-11T17:22:08.205993 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <websocket> on Server <EVM>
[----] I, [2018-01-11T17:22:08.209300 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <web_services> on Server <EVM>
```

Workers are requested to stop:
```
[----] I, [2018-01-11T17:22:08.211198 #25874:10d1140]  INFO -- : MIQ(MiqServer#quiesce_workers_loop) Stopping all active workers
[----] I, [2018-01-11T17:22:08.214771 #25874:10d1140]  INFO -- : MIQ(MiqServer#stop_worker) Stopping Worker [MiqPriorityWorker] with ID: [388], PID: [26101], GUID: [b295b366-6fde-4955-8354-440418099937], status [started]...
[----] I, [2018-01-11T17:22:08.218757 #25874:10d1140]  INFO -- : MIQ(MiqQueue.put) Message id: [10840],  id: [], Zone: [default], Role: [], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqEvent.raise_evm_event], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: [["MiqServer", 1], "evm_worker_stop", {:event_details=>"Stopping Worker [MiqPriorityWorker] with ID: [388], PID: [26101], GUID: [b295b366-6fde-4955-8354-440418099937], status [started]...", :type=>"MiqPriorityWorker"}]
[----] I, [2018-01-11T17:22:08.221947 #25874:10d1140]  INFO -- : MIQ(MiqServer#worker_set_message) Worker [MiqPriorityWorker] with ID: [388], PID: [26101], GUID: [b295b366-6fde-4955-8354-440418099937] is being requested to exit
...
```

Worker rows are removed:
```
[----] I, [2018-01-11T17:22:13.347425 #25874:10d1140]  INFO -- : MIQ(MiqServer#clean_worker_records) SQL Record for Worker [MiqPriorityWorker] with ID: [388], PID: [26101], GUID: [b295b366-6fde-4955-8354-440418099937], Status: [stopped] is being deleted
```

MiqQueue rows worked on by those workers are cleared:
```
[----] I, [2018-01-11T17:22:23.566711 #25874:10d1140]  INFO -- : MIQ(MiqServer#quiesce_all_workers) Cleaning all active messages being processed by MiqServer
```

Startup of replacement server is done automatically by systemd:
```
[----] I, [2018-01-11T17:22:39.069802 #26467:9bd138]  INFO -- : MIQ(EvmApplication.start) EVM Startup initiated
```

Comment 5 Joe Rafaniello 2018-01-18 20:45:08 UTC
Note, the server memory leak is being tracked in bug 1535720


Note You need to log in before you can comment on or make changes to this bug.