Description of problem: The Miq Server leaks memory over time. The leak rate is dependent on the workload, it has been seen to be as high as 500Mb per day. Leak has been observed on 5.7 and 5.9 versions of the server. Version-Release number of selected component (if applicable): How reproducible: Always, the rate of leak is variable. Steps to Reproduce: 1. Boot appliance 2. Monitor size (RSS/USS/PSS) size of the Miq Server process 3. Actual results: Server experiences unlimited memory size increases over time Server process reaches a steady state memory size and remains at that size. Expected results: Additional info:
https://github.com/ManageIQ/manageiq/pull/16807
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/529123f213109ddbe95bb800549b692761d54dea commit 529123f213109ddbe95bb800549b692761d54dea Author: Joe Rafaniello <jrafanie> AuthorDate: Wed Jan 10 14:54:53 2018 -0500 Commit: Joe Rafaniello <jrafanie> CommitDate: Thu Jan 11 17:16:14 2018 -0500 Server shuts down if memory usage exceeds a limit Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1533484 Previously, the server could slowly grow in memory usage and nothing was monitoring this. Now, we use the memory usage we already were collecting and react to it by shutting down the server process if it exceeds the configurable threshold. We rely upon systemd, or kubernetes in the future, to see our non-zero exit code and automatically start a new server process within a few minutes. Note, we are trying to be "graceful" in that we call shutdown_and_exit with a non-zero exit code. This will deactivate the roles on the server and attempt to gracefully shutdown the workers. Additionally, we create a notification so an administrator can look into it and report any future leaks. app/models/miq_server.rb | 21 +++++++++++++++++++++ config/settings.yml | 1 + db/fixtures/notification_types.yml | 5 +++++ spec/models/miq_server_spec.rb | 24 ++++++++++++++++++++++++ 4 files changed, 51 insertions(+)
Steps for Testing/QA [Optional] ------------------------------- Change the memory_threshold in the `:server` section in advanced settings From: `:memory_threshold: 2.gigabytes` To: a number a bit larger than the current process. Wait for the server to grow and exceed that limit. You'll see a notification in the UI after the server restarts:  You'll see this in the log: Memory exceeding limit and shutdown started: ``` [----] W, [2018-01-11T17:22:08.078414 #25874:10d1140] WARN -- : MIQ(MiqServer#monitor_myself) server(pid: 25874, name: EVM) memory usage [357347328] exceeded limit: [346030080]. Exiting server process. [----] I, [2018-01-11T17:22:08.142178 #25874:10d1140] INFO -- : MIQ(MiqServer#shutdown) initiated for MiqServer [EVM] with ID: [1], PID: [25874], GUID: [6d28a0a7-3b2b-4da4-9306-6799c74a4295] [----] I, [2018-01-11T17:22:08.166732 #25874:10d1140] INFO -- : MIQ(MiqQueue.put) Message id: [10839], id: [], Zone: [default], Role: [automate], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [20], State: [ready], Deliver On: [], Data: [], Args: [{:object_type=>"MiqServer", :object_id=>1, :attrs=>{:event_type=>"evm_server_stop", "MiqEvent::miq_event"=>612, :miq_event_id=>612, "EventStream::event_stream"=>612, :event_stream_id=>612}, :instance_name=>"Event", :user_id=>1, :miq_group_id=>2, :tenant_id=>1, :automate_message=>nil}] ``` Roles are deactivated: ``` [----] I, [2018-01-11T17:22:08.175980 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <ems_inventory> on Server <EVM> [----] I, [2018-01-11T17:22:08.178987 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <ems_operations> on Server <EVM> [----] I, [2018-01-11T17:22:08.181984 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <event> on Server <EVM> [----] I, [2018-01-11T17:22:08.185387 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <smartstate> on Server <EVM> [----] I, [2018-01-11T17:22:08.188961 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <automate> on Server <EVM> [----] I, [2018-01-11T17:22:08.193324 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <database_operations> on Server <EVM> [----] I, [2018-01-11T17:22:08.197099 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <reporting> on Server <EVM> [----] I, [2018-01-11T17:22:08.200250 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <scheduler> on Server <EVM> [----] I, [2018-01-11T17:22:08.203034 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <user_interface> on Server <EVM> [----] I, [2018-01-11T17:22:08.205993 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <websocket> on Server <EVM> [----] I, [2018-01-11T17:22:08.209300 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <web_services> on Server <EVM> ``` Workers are requested to stop: ``` [----] I, [2018-01-11T17:22:08.211198 #25874:10d1140] INFO -- : MIQ(MiqServer#quiesce_workers_loop) Stopping all active workers [----] I, [2018-01-11T17:22:08.214771 #25874:10d1140] INFO -- : MIQ(MiqServer#stop_worker) Stopping Worker [MiqPriorityWorker] with ID: [388], PID: [26101], GUID: [b295b366-6fde-4955-8354-440418099937], status [started]... [----] I, [2018-01-11T17:22:08.218757 #25874:10d1140] INFO -- : MIQ(MiqQueue.put) Message id: [10840], id: [], Zone: [default], Role: [], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqEvent.raise_evm_event], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: [["MiqServer", 1], "evm_worker_stop", {:event_details=>"Stopping Worker [MiqPriorityWorker] with ID: [388], PID: [26101], GUID: [b295b366-6fde-4955-8354-440418099937], status [started]...", :type=>"MiqPriorityWorker"}] [----] I, [2018-01-11T17:22:08.221947 #25874:10d1140] INFO -- : MIQ(MiqServer#worker_set_message) Worker [MiqPriorityWorker] with ID: [388], PID: [26101], GUID: [b295b366-6fde-4955-8354-440418099937] is being requested to exit ... ``` Worker rows are removed: ``` [----] I, [2018-01-11T17:22:13.347425 #25874:10d1140] INFO -- : MIQ(MiqServer#clean_worker_records) SQL Record for Worker [MiqPriorityWorker] with ID: [388], PID: [26101], GUID: [b295b366-6fde-4955-8354-440418099937], Status: [stopped] is being deleted ``` MiqQueue rows worked on by those workers are cleared: ``` [----] I, [2018-01-11T17:22:23.566711 #25874:10d1140] INFO -- : MIQ(MiqServer#quiesce_all_workers) Cleaning all active messages being processed by MiqServer ``` Startup of replacement server is done automatically by systemd: ``` [----] I, [2018-01-11T17:22:39.069802 #26467:9bd138] INFO -- : MIQ(EvmApplication.start) EVM Startup initiated ```
Note, the server memory leak is being tracked in bug 1535720