Description of problem: During a longer running Smartstate request (in this case for Azure Managed Disk Instances) the worker monitoring code thinks that the SmartProxy worker is not responding and therefore kills it. In fact the Smartproxy is actively working based on DEBUG level logs. This may be related to the fact that the worker is so busy hearbeating is not occuring but it is not clear. Version-Release number of selected component (if applicable): How reproducible: Occasionally when Smartstate takes at least 80 minutes this occurs. I am attaching two log snippets exhibiting this behavior. Steps to Reproduce: 1.Run Smartstate in an slow environment or on a very large Instance (or both). 2. 3. Actual results: Expected results: Smartstate Completes Additional info: The error message received when this problem occurs is similar to: WARN -- : MIQ(MiqServer#worker_not_responding) Worker [MiqSmartProxyWorker] with ID: [1000000029369], PID: [58033], GUID: [29a57144-ce2d-11e7-b191-000d3ad11b75] being killed because it is not responding. This issue was found in a log uploaded to BZ https://bugzilla.redhat.com/show_bug.cgi?id=1508154 which demonstrated several different issues and is therefore being handled via this separate BZ.
Created attachment 1361227 [details] An example of this issue in the evm.log file.
Created attachment 1361228 [details] Another example of the issue.
As a note of info the two cases shown in the attachments are both for Linux Instances.
https://github.com/ManageIQ/manageiq/pull/16685
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/906ed996fbf690844d9a9382c3019a31fb4482da commit 906ed996fbf690844d9a9382c3019a31fb4482da Author: Jerry Keselman <jerryk> AuthorDate: Tue Dec 19 10:10:07 2017 -0500 Commit: Jerry Keselman <jerryk> CommitDate: Thu Dec 21 15:48:26 2017 -0500 Add Heartbeat Thread to SmartProxy Worker In order to fix an issue where long-running Smartstate jobs get killed under the mistaken assumption that they are being unresponsive when they are actually quite busy, a separate thread is being added to the SmartProxy Worker which just heartbeats every 30 seconds. Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1519538 app/models/miq_smart_proxy_worker/runner.rb | 66 +++++++++++++++++++++++++++++ config/settings.yml | 3 +- 2 files changed, 68 insertions(+), 1 deletion(-)