Bug 1481800 - Workers processing a miq_queue message that exceed the memory threshold aren't given enough time to exit gracefully
Summary: Workers processing a miq_queue message that exceed the memory threshold aren'...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Appliance
Version: 5.7.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: GA
: 5.9.0
Assignee: Joe Rafaniello
QA Contact: Tasos Papaioannou
URL:
Whiteboard:
Depends On:
Blocks: 1482670 1482672
TreeView+ depends on / blocked
 
Reported: 2017-08-15 18:29 UTC by Joe Rafaniello
Modified: 2018-05-23 14:56 UTC (History)
6 users (show)

Fixed In Version: 5.9.0.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1482670 1482672 (view as bug list)
Environment:
Last Closed: 2018-03-06 15:31:05 UTC
Category: ---
Cloudforms Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Joe Rafaniello 2017-08-15 18:29:12 UTC
Description of problem: Workers running a long running task such as generating a complicated report are not given enough time to complete the task if they hit the memory threshold.

Version-Release number of selected component (if applicable):


How reproducible:
https://bugzilla.redhat.com/show_bug.cgi?id=1395736

Steps to Reproduce:
1.  Set memory_threshold for the generic_worker to it's "idle" PSS memory usage + 20 MB 
2.  Set stopping_timeout to a small value, such as 30 seconds
3.  Decrease the generic worker count to 1 so it's the only process doing reports.
4.  Schedule a bunch of long running reports

Actual results:
The worker gets killed after processing the miq_queue work item for "stopping_timeout" seconds.

Typically, this happens on custom reports that include virtual columns that we can't sort/aggregate in SQL and instead do in ruby.


Expected results:
If the miq_queue item has a msg_timeout of 60 minutes and the worker hits the memory threshold, we should honor the msg_timeout since it's expected that the worker will take up to that amount of time to run the work item.  After that time, it is fine to kill the worker.

Additional info:

Comment 2 Joe Rafaniello 2017-08-15 18:30:02 UTC
Note, the original BZ that implemented the stopping_timeout was 
https://bugzilla.redhat.com/show_bug.cgi?id=1395736

Comment 4 Dave Johnson 2017-08-16 20:28:39 UTC
Please assess the impact of this issue and update the severity accordingly.  Please refer to https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity for a reminder on each severity's definition.

If it's something like a tracker bug where it doesn't matter, please set it to Low/Low.

Comment 5 CFME Bot 2017-08-17 13:46:28 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/31c07a1d0edb1f88e76a993f732c9399ec68e8ca

commit 31c07a1d0edb1f88e76a993f732c9399ec68e8ca
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Fri Jul 7 17:13:52 2017 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Tue Aug 15 14:31:13 2017 -0400

    Let queue workers process an active message
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1481800
    
    In e5f4bd3fe1299070e40d235be963428b4f9a2d14, we added a 10 minute
    timeout that would give workers a little time to complete their work
    after they exceed their memory threshold before we'd kill them.  This
    causes workers to be killed prematurely before completing the work item.
    
    What we really want is for the work item to complete but kill the worker
    if the worker has exceeded memory/time thresholds and the work item hasn't
    completed in a reasonable time.  This reasonable time is the msg_timeout
    associated with the queue message.

 app/models/miq_worker.rb       |  2 +-
 spec/models/miq_worker_spec.rb | 14 ++++++++++++++
 2 files changed, 15 insertions(+), 1 deletion(-)

Comment 6 CFME Bot 2017-08-17 13:46:32 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/8388fdfb66ddc968d02947cd0d981f4537a41357

commit 8388fdfb66ddc968d02947cd0d981f4537a41357
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Fri Jul 7 17:33:12 2017 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Tue Aug 15 14:31:28 2017 -0400

    The stop is pending, it's not actively stopping
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1481800
    
    The worker is probably working on a queue message that takes a long time
    so we let it try to complete this work item and have a follow up work
    item where we ask the worker to exit cleanly on it's own.  "Stop pending"
    better describes this graceful worker exit workflow.
    
    ```
    ** Using session_store: ActionDispatch::Session::MemCacheStore
    Checking EVM status...
     Zone    | Server | Status  |            ID |   PID |  SPID | URL                     | Started On           | Last Heartbeat       | Master? | Active Roles
    ---------+--------+---------+---------------+-------+-------+-------------------------+----------------------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------
     default | EVM    | started | 1000000000001 | 38192 | 38206 | druby://127.0.0.1:50844 | 2017-07-07T21:29:20Z | 2017-07-07T21:32:34Z | true    | automate:database_operations:database_owner:ems_inventory:ems_operations:event:reporting:scheduler:smartstate:user_interface:web_services:websocket
    
     Worker Type      | Status       |            ID |   PID | SPID  |     Server id | Queue Name / URL    | Started On           | Last Heartbeat       | MB Usage
    ------------------+--------------+---------------+-------+-------+---------------+---------------------+----------------------+----------------------+----------
     MiqGenericWorker | stop pending | 1000000000207 | 38374 | 38380 | 1000000000001 | generic             | 2017-07-07T21:32:19Z | 2017-07-07T21:32:33Z |      245
     MiqUiWorker      | started      | 1000000000206 | 38234 |       | 1000000000001 | http://0.0.0.0:3000 | 2017-07-07T21:29:21Z | 2017-07-07T21:32:34Z |      533
    ```

 lib/tasks/evm_application.rb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comment 9 Tasos Papaioannou 2018-01-08 18:38:02 UTC
Verified.


Note You need to log in before you can comment on or make changes to this bug.