Bug 1533484

Summary: Miq Server leaks memory and we fail to detect and remediate it
Product: Red Hat CloudForms Management Engine Reporter: dmetzger
Component: ApplianceAssignee: Joe Rafaniello <jrafanie>
Status: CLOSED CURRENTRELEASE QA Contact: Dave Johnson <dajohnso>
Severity: high Docs Contact:
Priority: high    
Version: 5.8.0CC: abellott, cpelland, jrafanie, obarenbo, simaishi
Target Milestone: GAKeywords: TestOnly
Target Release: 5.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 5.10.0.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1537131 1537132 (view as bug list) Environment:
Last Closed: 2018-06-21 20:26:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: CFME Core Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1537131, 1537132    

Description dmetzger 2018-01-11 13:29:28 UTC
Description of problem:

The Miq Server leaks memory over time. The leak rate is dependent on the workload, it has been seen to be as high as 500Mb per day.

Leak has been observed on 5.7 and 5.9 versions of the server.

Version-Release number of selected component (if applicable):


How reproducible:

Always, the rate of leak is variable.

Steps to Reproduce:
1. Boot appliance
2. Monitor size (RSS/USS/PSS) size of the Miq Server process
3.

Actual results:

Server experiences unlimited memory size increases over time

Server process reaches a steady state memory size and remains at that size.

Expected results:


Additional info:

Comment 3 CFME Bot 2018-01-12 14:01:59 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/529123f213109ddbe95bb800549b692761d54dea

commit 529123f213109ddbe95bb800549b692761d54dea
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Wed Jan 10 14:54:53 2018 -0500
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Thu Jan 11 17:16:14 2018 -0500

    Server shuts down if memory usage exceeds a limit
    
    Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1533484
    
    Previously, the server could slowly grow in memory usage and nothing was
    monitoring this.  Now, we use the memory usage we already were
    collecting and react to it by shutting down the server process if it
    exceeds the configurable threshold.  We rely upon systemd, or kubernetes
    in the future, to see our non-zero exit code and automatically start a
    new server process within a few minutes.
    
    Note, we are trying to be "graceful" in that we call shutdown_and_exit
    with a non-zero exit code.  This will deactivate the roles on the
    server and attempt to gracefully shutdown the workers.
    
    Additionally, we create a notification so an administrator can look
    into it and report any future leaks.

 app/models/miq_server.rb           | 21 +++++++++++++++++++++
 config/settings.yml                |  1 +
 db/fixtures/notification_types.yml |  5 +++++
 spec/models/miq_server_spec.rb     | 24 ++++++++++++++++++++++++
 4 files changed, 51 insertions(+)

Comment 4 Joe Rafaniello 2018-01-12 16:59:12 UTC
Steps for Testing/QA [Optional]
-------------------------------

Change the memory_threshold in the `:server` section in advanced settings

From:
`:memory_threshold: 2.gigabytes`

To: a number a bit larger than the current process.

Wait for the server to grow and exceed that limit.

You'll see a notification in the UI after the server restarts:

![image](https://user-images.githubusercontent.com/19339/34850556-5280c77c-f6f4-11e7-9cb4-b8b3ec84ce6e.png)


You'll see this in the log:

Memory exceeding limit and shutdown started:
```
[----] W, [2018-01-11T17:22:08.078414 #25874:10d1140]  WARN -- : MIQ(MiqServer#monitor_myself) server(pid: 25874, name: EVM) memory usage [357347328] exceeded limit: [346030080].  Exiting server process.
[----] I, [2018-01-11T17:22:08.142178 #25874:10d1140]  INFO -- : MIQ(MiqServer#shutdown) initiated for MiqServer [EVM] with ID: [1], PID: [25874], GUID: [6d28a0a7-3b2b-4da4-9306-6799c74a4295]
[----] I, [2018-01-11T17:22:08.166732 #25874:10d1140]  INFO -- : MIQ(MiqQueue.put) Message id: [10839],  id: [], Zone: [default], Role: [automate], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [20], State: [ready], Deliver On: [], Data: [], Args: [{:object_type=>"MiqServer", :object_id=>1, :attrs=>{:event_type=>"evm_server_stop", "MiqEvent::miq_event"=>612, :miq_event_id=>612, "EventStream::event_stream"=>612, :event_stream_id=>612}, :instance_name=>"Event", :user_id=>1, :miq_group_id=>2, :tenant_id=>1, :automate_message=>nil}]
```

Roles are deactivated:
```
[----] I, [2018-01-11T17:22:08.175980 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <ems_inventory> on Server <EVM>
[----] I, [2018-01-11T17:22:08.178987 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <ems_operations> on Server <EVM>
[----] I, [2018-01-11T17:22:08.181984 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <event> on Server <EVM>
[----] I, [2018-01-11T17:22:08.185387 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <smartstate> on Server <EVM>
[----] I, [2018-01-11T17:22:08.188961 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <automate> on Server <EVM>
[----] I, [2018-01-11T17:22:08.193324 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <database_operations> on Server <EVM>
[----] I, [2018-01-11T17:22:08.197099 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <reporting> on Server <EVM>
[----] I, [2018-01-11T17:22:08.200250 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <scheduler> on Server <EVM>
[----] I, [2018-01-11T17:22:08.203034 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <user_interface> on Server <EVM>
[----] I, [2018-01-11T17:22:08.205993 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <websocket> on Server <EVM>
[----] I, [2018-01-11T17:22:08.209300 #25874:10d1140]  INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <web_services> on Server <EVM>
```

Workers are requested to stop:
```
[----] I, [2018-01-11T17:22:08.211198 #25874:10d1140]  INFO -- : MIQ(MiqServer#quiesce_workers_loop) Stopping all active workers
[----] I, [2018-01-11T17:22:08.214771 #25874:10d1140]  INFO -- : MIQ(MiqServer#stop_worker) Stopping Worker [MiqPriorityWorker] with ID: [388], PID: [26101], GUID: [b295b366-6fde-4955-8354-440418099937], status [started]...
[----] I, [2018-01-11T17:22:08.218757 #25874:10d1140]  INFO -- : MIQ(MiqQueue.put) Message id: [10840],  id: [], Zone: [default], Role: [], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqEvent.raise_evm_event], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: [["MiqServer", 1], "evm_worker_stop", {:event_details=>"Stopping Worker [MiqPriorityWorker] with ID: [388], PID: [26101], GUID: [b295b366-6fde-4955-8354-440418099937], status [started]...", :type=>"MiqPriorityWorker"}]
[----] I, [2018-01-11T17:22:08.221947 #25874:10d1140]  INFO -- : MIQ(MiqServer#worker_set_message) Worker [MiqPriorityWorker] with ID: [388], PID: [26101], GUID: [b295b366-6fde-4955-8354-440418099937] is being requested to exit
...
```

Worker rows are removed:
```
[----] I, [2018-01-11T17:22:13.347425 #25874:10d1140]  INFO -- : MIQ(MiqServer#clean_worker_records) SQL Record for Worker [MiqPriorityWorker] with ID: [388], PID: [26101], GUID: [b295b366-6fde-4955-8354-440418099937], Status: [stopped] is being deleted
```

MiqQueue rows worked on by those workers are cleared:
```
[----] I, [2018-01-11T17:22:23.566711 #25874:10d1140]  INFO -- : MIQ(MiqServer#quiesce_all_workers) Cleaning all active messages being processed by MiqServer
```

Startup of replacement server is done automatically by systemd:
```
[----] I, [2018-01-11T17:22:39.069802 #26467:9bd138]  INFO -- : MIQ(EvmApplication.start) EVM Startup initiated
```

Comment 5 Joe Rafaniello 2018-01-18 20:45:08 UTC
Note, the server memory leak is being tracked in bug 1535720