Bug 723896

Summary: Documentation: need smarter conductor rhevm heartbeat
Product: [Retired] CloudForms Cloud Engine Reporter: wes hayutin <whayutin>
Component: DocumentationAssignee: Justin Clift <jclift>
Status: CLOSED CURRENTRELEASE QA Contact: wes hayutin <whayutin>
Severity: high Docs Contact:
Priority: unspecified    
Version: 0.3.1CC: akarol, clalance, dajohnso, deltacloud-maint, kwade, lutter, morazi, ssachdev, whayutin
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 722206 Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description wes hayutin 2011-07-21 13:38:30 UTC
+++ This bug was initially created as a clone of Bug #722206 +++

Description of problem:
=======================
During testing I noticed that when I had 6+ vms within rhevm, the api console would practically scroll continuously with a cpu around 80%.  And if I added a few more, deltacloud-core timeouts would begin happening at 60 seconds.

I opened a bug against deltacloud-core here:
https://issues.apache.org/jira/browse/DTACLOUD-58

Within the jira bug, I commented that I can see a heartbeat checking the instances every 30 seconds.  When you have 6+ instances, the time to process the heartbeat hitting api/instances takes 40+ seconds, which is greater than the delay between heartbeats.  

David Lutterkort commented that this heartbeat is coming from conductor, not a deltacloud-core internal mechanism, and is too fast of an interval.

--- Additional comment from clalance on 2011-07-14 12:02:29 EDT ---

Yes, we knew this was going to be a problem.  I don't think tuning down the poll interval is the right solution.  I think that we need to:

1)  Upgrade to RHEV-M 3.0, where this should be a lot faster
2)  Do a lighter weight status update, which should also be faster

We'll need to look at doing one (or both) of these in the deltacloud driver to improve this situation.

--- Additional comment from lutter on 2011-07-14 12:55:51 EDT ---

*** Bug 722226 has been marked as a duplicate of this bug. ***

--- Additional comment from lutter on 2011-07-14 12:57:35 EDT ---

I think the easiest fix is to skip update runs if a previous one hasn't finished yet. Slow backends will be slow, no matter what we do.

Markmc gives me the impression that asking for fewer details when we list instances isn't going to speed matters up.

--- Additional comment from clalance on 2011-07-14 13:19:36 EDT ---

Actually, that is what condor does today.  If a batch status times out, it will skip it and try to re-ping it later.

That being said, there is likely a bug in there, in that it can take a long time for condor to re-activate the backend.  Also, I'm not sure how we would ever get out of this situation.  If the last status update took > 30 seconds (for instance), why won't the next one?

--- Additional comment from whayutin on 2011-07-21 09:21:19 EDT ---

doc it

Comment 1 wes hayutin 2011-08-01 19:42:36 UTC
BZ 723896 - Need smarter conductor RHEV-M heartbeat
Current implementation can result in overloading the RHEV-M v2.2 server if more than five instances exist due to limitation in the RHEV-M api.
Users should limit the number of instances running within RHEV-M.

Comment 2 wes hayutin 2011-08-01 19:48:32 UTC
removing from tracker

Comment 3 wes hayutin 2011-08-01 19:55:23 UTC
release pending...

Comment 4 wes hayutin 2011-08-01 19:57:10 UTC
release pending...

Comment 6 wes hayutin 2011-12-08 13:53:04 UTC
closing out old bugs

Comment 7 wes hayutin 2011-12-08 14:06:57 UTC
perm close