Bug 1519538 - Smartstate request taking too long is killed because Worker Monitoring Code incorrectly thinks the busy Smartproxy Worker is not responding
Summary: Smartstate request taking too long is killed because Worker Monitoring Code i...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: SmartState Analysis
Version: 5.8.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.10.0
Assignee: Jerry Keselman
QA Contact: Dave Johnson
URL:
Whiteboard:
Depends On:
Blocks: 1531299 1532854
TreeView+ depends on / blocked
 
Reported: 2017-11-30 21:07 UTC by Jerry Keselman
Modified: 2019-08-06 20:05 UTC (History)
3 users (show)

Fixed In Version: 5.10.0.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1531299 1532854 (view as bug list)
Environment:
Last Closed: 2018-06-21 20:47:36 UTC
Category: ---
Cloudforms Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
An example of this issue in the evm.log file. (3.82 MB, text/plain)
2017-11-30 21:10 UTC, Jerry Keselman
no flags Details
Another example of the issue. (2.60 MB, text/plain)
2017-11-30 21:11 UTC, Jerry Keselman
no flags Details

Description Jerry Keselman 2017-11-30 21:07:23 UTC
Description of problem:
During a longer running Smartstate request (in this case for Azure Managed Disk Instances) the worker monitoring code thinks that the SmartProxy worker is not responding and therefore kills it.  In fact the Smartproxy is actively working based on DEBUG level logs.  This may be related to the fact that the worker is so busy hearbeating is not occuring but it is not clear.


Version-Release number of selected component (if applicable):


How reproducible:  Occasionally when Smartstate takes at least 80 minutes this occurs.  I am attaching two log snippets exhibiting this behavior.


Steps to Reproduce:
1.Run Smartstate in an slow environment or on a very large Instance (or both).
2.
3.

Actual results:


Expected results:
Smartstate Completes

Additional info:
The error message received when this problem occurs is similar to: 
WARN -- : MIQ(MiqServer#worker_not_responding) Worker [MiqSmartProxyWorker] with ID: [1000000029369], PID: [58033], GUID: [29a57144-ce2d-11e7-b191-000d3ad11b75] being killed because it is not responding.

This issue was found in a log uploaded to BZ https://bugzilla.redhat.com/show_bug.cgi?id=1508154 which demonstrated several different issues and is therefore being handled via this separate BZ.

Comment 2 Jerry Keselman 2017-11-30 21:10:22 UTC
Created attachment 1361227 [details]
An example of this issue in  the evm.log file.

Comment 3 Jerry Keselman 2017-11-30 21:11:44 UTC
Created attachment 1361228 [details]
Another example of the issue.

Comment 4 Jerry Keselman 2017-11-30 21:21:54 UTC
As a note of info the two cases shown in the attachments are both for Linux Instances.

Comment 6 CFME Bot 2018-01-04 21:46:15 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/906ed996fbf690844d9a9382c3019a31fb4482da

commit 906ed996fbf690844d9a9382c3019a31fb4482da
Author:     Jerry Keselman <jerryk>
AuthorDate: Tue Dec 19 10:10:07 2017 -0500
Commit:     Jerry Keselman <jerryk>
CommitDate: Thu Dec 21 15:48:26 2017 -0500

    Add Heartbeat Thread to SmartProxy Worker
    
    In order to fix an issue where long-running Smartstate jobs get killed
    under the mistaken assumption that they are being unresponsive when they
    are actually quite busy, a separate thread is being added to the SmartProxy Worker
    which just heartbeats every 30 seconds.
    
    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1519538

 app/models/miq_smart_proxy_worker/runner.rb | 66 +++++++++++++++++++++++++++++
 config/settings.yml                         |  3 +-
 2 files changed, 68 insertions(+), 1 deletion(-)


Note You need to log in before you can comment on or make changes to this bug.