Bug 1533484
Summary: | Miq Server leaks memory and we fail to detect and remediate it | |||
---|---|---|---|---|
Product: | Red Hat CloudForms Management Engine | Reporter: | dmetzger | |
Component: | Appliance | Assignee: | Joe Rafaniello <jrafanie> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Dave Johnson <dajohnso> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 5.8.0 | CC: | abellott, cpelland, jrafanie, obarenbo, simaishi | |
Target Milestone: | GA | Keywords: | TestOnly | |
Target Release: | 5.10.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | 5.10.0.0 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1537131 1537132 (view as bug list) | Environment: | ||
Last Closed: | 2018-06-21 20:26:52 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | CFME Core | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1537131, 1537132 |
Description
dmetzger
2018-01-11 13:29:28 UTC
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/529123f213109ddbe95bb800549b692761d54dea commit 529123f213109ddbe95bb800549b692761d54dea Author: Joe Rafaniello <jrafanie> AuthorDate: Wed Jan 10 14:54:53 2018 -0500 Commit: Joe Rafaniello <jrafanie> CommitDate: Thu Jan 11 17:16:14 2018 -0500 Server shuts down if memory usage exceeds a limit Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1533484 Previously, the server could slowly grow in memory usage and nothing was monitoring this. Now, we use the memory usage we already were collecting and react to it by shutting down the server process if it exceeds the configurable threshold. We rely upon systemd, or kubernetes in the future, to see our non-zero exit code and automatically start a new server process within a few minutes. Note, we are trying to be "graceful" in that we call shutdown_and_exit with a non-zero exit code. This will deactivate the roles on the server and attempt to gracefully shutdown the workers. Additionally, we create a notification so an administrator can look into it and report any future leaks. app/models/miq_server.rb | 21 +++++++++++++++++++++ config/settings.yml | 1 + db/fixtures/notification_types.yml | 5 +++++ spec/models/miq_server_spec.rb | 24 ++++++++++++++++++++++++ 4 files changed, 51 insertions(+) Steps for Testing/QA [Optional] ------------------------------- Change the memory_threshold in the `:server` section in advanced settings From: `:memory_threshold: 2.gigabytes` To: a number a bit larger than the current process. Wait for the server to grow and exceed that limit. You'll see a notification in the UI after the server restarts:  You'll see this in the log: Memory exceeding limit and shutdown started: ``` [----] W, [2018-01-11T17:22:08.078414 #25874:10d1140] WARN -- : MIQ(MiqServer#monitor_myself) server(pid: 25874, name: EVM) memory usage [357347328] exceeded limit: [346030080]. Exiting server process. [----] I, [2018-01-11T17:22:08.142178 #25874:10d1140] INFO -- : MIQ(MiqServer#shutdown) initiated for MiqServer [EVM] with ID: [1], PID: [25874], GUID: [6d28a0a7-3b2b-4da4-9306-6799c74a4295] [----] I, [2018-01-11T17:22:08.166732 #25874:10d1140] INFO -- : MIQ(MiqQueue.put) Message id: [10839], id: [], Zone: [default], Role: [automate], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [20], State: [ready], Deliver On: [], Data: [], Args: [{:object_type=>"MiqServer", :object_id=>1, :attrs=>{:event_type=>"evm_server_stop", "MiqEvent::miq_event"=>612, :miq_event_id=>612, "EventStream::event_stream"=>612, :event_stream_id=>612}, :instance_name=>"Event", :user_id=>1, :miq_group_id=>2, :tenant_id=>1, :automate_message=>nil}] ``` Roles are deactivated: ``` [----] I, [2018-01-11T17:22:08.175980 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <ems_inventory> on Server <EVM> [----] I, [2018-01-11T17:22:08.178987 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <ems_operations> on Server <EVM> [----] I, [2018-01-11T17:22:08.181984 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <event> on Server <EVM> [----] I, [2018-01-11T17:22:08.185387 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <smartstate> on Server <EVM> [----] I, [2018-01-11T17:22:08.188961 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <automate> on Server <EVM> [----] I, [2018-01-11T17:22:08.193324 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <database_operations> on Server <EVM> [----] I, [2018-01-11T17:22:08.197099 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <reporting> on Server <EVM> [----] I, [2018-01-11T17:22:08.200250 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <scheduler> on Server <EVM> [----] I, [2018-01-11T17:22:08.203034 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <user_interface> on Server <EVM> [----] I, [2018-01-11T17:22:08.205993 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <websocket> on Server <EVM> [----] I, [2018-01-11T17:22:08.209300 #25874:10d1140] INFO -- : MIQ(AssignedServerRole#deactivate) Deactivating Role <web_services> on Server <EVM> ``` Workers are requested to stop: ``` [----] I, [2018-01-11T17:22:08.211198 #25874:10d1140] INFO -- : MIQ(MiqServer#quiesce_workers_loop) Stopping all active workers [----] I, [2018-01-11T17:22:08.214771 #25874:10d1140] INFO -- : MIQ(MiqServer#stop_worker) Stopping Worker [MiqPriorityWorker] with ID: [388], PID: [26101], GUID: [b295b366-6fde-4955-8354-440418099937], status [started]... [----] I, [2018-01-11T17:22:08.218757 #25874:10d1140] INFO -- : MIQ(MiqQueue.put) Message id: [10840], id: [], Zone: [default], Role: [], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqEvent.raise_evm_event], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: [["MiqServer", 1], "evm_worker_stop", {:event_details=>"Stopping Worker [MiqPriorityWorker] with ID: [388], PID: [26101], GUID: [b295b366-6fde-4955-8354-440418099937], status [started]...", :type=>"MiqPriorityWorker"}] [----] I, [2018-01-11T17:22:08.221947 #25874:10d1140] INFO -- : MIQ(MiqServer#worker_set_message) Worker [MiqPriorityWorker] with ID: [388], PID: [26101], GUID: [b295b366-6fde-4955-8354-440418099937] is being requested to exit ... ``` Worker rows are removed: ``` [----] I, [2018-01-11T17:22:13.347425 #25874:10d1140] INFO -- : MIQ(MiqServer#clean_worker_records) SQL Record for Worker [MiqPriorityWorker] with ID: [388], PID: [26101], GUID: [b295b366-6fde-4955-8354-440418099937], Status: [stopped] is being deleted ``` MiqQueue rows worked on by those workers are cleared: ``` [----] I, [2018-01-11T17:22:23.566711 #25874:10d1140] INFO -- : MIQ(MiqServer#quiesce_all_workers) Cleaning all active messages being processed by MiqServer ``` Startup of replacement server is done automatically by systemd: ``` [----] I, [2018-01-11T17:22:39.069802 #26467:9bd138] INFO -- : MIQ(EvmApplication.start) EVM Startup initiated ``` Note, the server memory leak is being tracked in bug 1535720 |