Description of problem: Workers running a long running task such as generating a complicated report are not given enough time to complete the task if they hit the memory threshold. Version-Release number of selected component (if applicable): How reproducible: https://bugzilla.redhat.com/show_bug.cgi?id=1395736 Steps to Reproduce: 1. Set memory_threshold for the generic_worker to it's "idle" PSS memory usage + 20 MB 2. Set stopping_timeout to a small value, such as 30 seconds 3. Decrease the generic worker count to 1 so it's the only process doing reports. 4. Schedule a bunch of long running reports Actual results: The worker gets killed after processing the miq_queue work item for "stopping_timeout" seconds. Typically, this happens on custom reports that include virtual columns that we can't sort/aggregate in SQL and instead do in ruby. Expected results: If the miq_queue item has a msg_timeout of 60 minutes and the worker hits the memory threshold, we should honor the msg_timeout since it's expected that the worker will take up to that amount of time to run the work item. After that time, it is fine to kill the worker. Additional info:
Note, the original BZ that implemented the stopping_timeout was https://bugzilla.redhat.com/show_bug.cgi?id=1395736
https://github.com/ManageIQ/manageiq/pull/15529
Please assess the impact of this issue and update the severity accordingly. Please refer to https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity for a reminder on each severity's definition. If it's something like a tracker bug where it doesn't matter, please set it to Low/Low.
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/31c07a1d0edb1f88e76a993f732c9399ec68e8ca commit 31c07a1d0edb1f88e76a993f732c9399ec68e8ca Author: Joe Rafaniello <jrafanie> AuthorDate: Fri Jul 7 17:13:52 2017 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Tue Aug 15 14:31:13 2017 -0400 Let queue workers process an active message https://bugzilla.redhat.com/show_bug.cgi?id=1481800 In e5f4bd3fe1299070e40d235be963428b4f9a2d14, we added a 10 minute timeout that would give workers a little time to complete their work after they exceed their memory threshold before we'd kill them. This causes workers to be killed prematurely before completing the work item. What we really want is for the work item to complete but kill the worker if the worker has exceeded memory/time thresholds and the work item hasn't completed in a reasonable time. This reasonable time is the msg_timeout associated with the queue message. app/models/miq_worker.rb | 2 +- spec/models/miq_worker_spec.rb | 14 ++++++++++++++ 2 files changed, 15 insertions(+), 1 deletion(-)
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/8388fdfb66ddc968d02947cd0d981f4537a41357 commit 8388fdfb66ddc968d02947cd0d981f4537a41357 Author: Joe Rafaniello <jrafanie> AuthorDate: Fri Jul 7 17:33:12 2017 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Tue Aug 15 14:31:28 2017 -0400 The stop is pending, it's not actively stopping https://bugzilla.redhat.com/show_bug.cgi?id=1481800 The worker is probably working on a queue message that takes a long time so we let it try to complete this work item and have a follow up work item where we ask the worker to exit cleanly on it's own. "Stop pending" better describes this graceful worker exit workflow. ``` ** Using session_store: ActionDispatch::Session::MemCacheStore Checking EVM status... Zone | Server | Status | ID | PID | SPID | URL | Started On | Last Heartbeat | Master? | Active Roles ---------+--------+---------+---------------+-------+-------+-------------------------+----------------------+----------------------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------- default | EVM | started | 1000000000001 | 38192 | 38206 | druby://127.0.0.1:50844 | 2017-07-07T21:29:20Z | 2017-07-07T21:32:34Z | true | automate:database_operations:database_owner:ems_inventory:ems_operations:event:reporting:scheduler:smartstate:user_interface:web_services:websocket Worker Type | Status | ID | PID | SPID | Server id | Queue Name / URL | Started On | Last Heartbeat | MB Usage ------------------+--------------+---------------+-------+-------+---------------+---------------------+----------------------+----------------------+---------- MiqGenericWorker | stop pending | 1000000000207 | 38374 | 38380 | 1000000000001 | generic | 2017-07-07T21:32:19Z | 2017-07-07T21:32:33Z | 245 MiqUiWorker | started | 1000000000206 | 38234 | | 1000000000001 | http://0.0.0.0:3000 | 2017-07-07T21:29:21Z | 2017-07-07T21:32:34Z | 533 ``` lib/tasks/evm_application.rb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Verified.