Bug 1519538

Summary: Smartstate request taking too long is killed because Worker Monitoring Code incorrectly thinks the busy Smartproxy Worker is not responding
Product: Red Hat CloudForms Management Engine Reporter: Jerry Keselman <jkeselma>
Component: SmartState AnalysisAssignee: Jerry Keselman <jkeselma>
Status: CLOSED CURRENTRELEASE QA Contact: Dave Johnson <dajohnso>
Severity: high Docs Contact:
Priority: high    
Version: 5.8.0CC: jhardy, obarenbo, simaishi
Target Milestone: GAKeywords: TestOnly
Target Release: 5.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 5.10.0.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1531299 1532854 (view as bug list) Environment:
Last Closed: 2018-06-21 20:47:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1531299, 1532854    
Attachments:
Description Flags
An example of this issue in the evm.log file.
none
Another example of the issue. none

Description Jerry Keselman 2017-11-30 21:07:23 UTC
Description of problem:
During a longer running Smartstate request (in this case for Azure Managed Disk Instances) the worker monitoring code thinks that the SmartProxy worker is not responding and therefore kills it.  In fact the Smartproxy is actively working based on DEBUG level logs.  This may be related to the fact that the worker is so busy hearbeating is not occuring but it is not clear.


Version-Release number of selected component (if applicable):


How reproducible:  Occasionally when Smartstate takes at least 80 minutes this occurs.  I am attaching two log snippets exhibiting this behavior.


Steps to Reproduce:
1.Run Smartstate in an slow environment or on a very large Instance (or both).
2.
3.

Actual results:


Expected results:
Smartstate Completes

Additional info:
The error message received when this problem occurs is similar to: 
WARN -- : MIQ(MiqServer#worker_not_responding) Worker [MiqSmartProxyWorker] with ID: [1000000029369], PID: [58033], GUID: [29a57144-ce2d-11e7-b191-000d3ad11b75] being killed because it is not responding.

This issue was found in a log uploaded to BZ https://bugzilla.redhat.com/show_bug.cgi?id=1508154 which demonstrated several different issues and is therefore being handled via this separate BZ.

Comment 2 Jerry Keselman 2017-11-30 21:10:22 UTC
Created attachment 1361227 [details]
An example of this issue in  the evm.log file.

Comment 3 Jerry Keselman 2017-11-30 21:11:44 UTC
Created attachment 1361228 [details]
Another example of the issue.

Comment 4 Jerry Keselman 2017-11-30 21:21:54 UTC
As a note of info the two cases shown in the attachments are both for Linux Instances.

Comment 6 CFME Bot 2018-01-04 21:46:15 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/906ed996fbf690844d9a9382c3019a31fb4482da

commit 906ed996fbf690844d9a9382c3019a31fb4482da
Author:     Jerry Keselman <jerryk>
AuthorDate: Tue Dec 19 10:10:07 2017 -0500
Commit:     Jerry Keselman <jerryk>
CommitDate: Thu Dec 21 15:48:26 2017 -0500

    Add Heartbeat Thread to SmartProxy Worker
    
    In order to fix an issue where long-running Smartstate jobs get killed
    under the mistaken assumption that they are being unresponsive when they
    are actually quite busy, a separate thread is being added to the SmartProxy Worker
    which just heartbeats every 30 seconds.
    
    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1519538

 app/models/miq_smart_proxy_worker/runner.rb | 66 +++++++++++++++++++++++++++++
 config/settings.yml                         |  3 +-
 2 files changed, 68 insertions(+), 1 deletion(-)