Bug 722206 - need smarter conductor rhevm heartbeat
Summary: need smarter conductor rhevm heartbeat
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: CloudForms Cloud Engine
Classification: Retired
Component: aeolus-conductor
Version: 1.0.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: rc
Assignee: Angus Thomas
QA Contact: Dave Johnson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-07-14 15:49 UTC by Dave Johnson
Modified: 2012-05-15 21:44 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 723896 (view as bug list)
Environment:
Last Closed: 2012-05-15 21:44:56 UTC


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2012:0583 0 normal SHIPPED_LIVE new packages: aeolus-conductor 2012-05-15 22:31:59 UTC

Description Dave Johnson 2011-07-14 15:49:28 UTC
Description of problem:
=======================
During testing I noticed that when I had 6+ vms within rhevm, the api console would practically scroll continuously with a cpu around 80%.  And if I added a few more, deltacloud-core timeouts would begin happening at 60 seconds.

I opened a bug against deltacloud-core here:
https://issues.apache.org/jira/browse/DTACLOUD-58

Within the jira bug, I commented that I can see a heartbeat checking the instances every 30 seconds.  When you have 6+ instances, the time to process the heartbeat hitting api/instances takes 40+ seconds, which is greater than the delay between heartbeats.  

David Lutterkort commented that this heartbeat is coming from conductor, not a deltacloud-core internal mechanism, and is too fast of an interval.

Comment 1 Chris Lalancette 2011-07-14 16:02:29 UTC
Yes, we knew this was going to be a problem.  I don't think tuning down the poll interval is the right solution.  I think that we need to:

1)  Upgrade to RHEV-M 3.0, where this should be a lot faster
2)  Do a lighter weight status update, which should also be faster

We'll need to look at doing one (or both) of these in the deltacloud driver to improve this situation.

Comment 2 David Lutterkort 2011-07-14 16:55:51 UTC
*** Bug 722226 has been marked as a duplicate of this bug. ***

Comment 3 David Lutterkort 2011-07-14 16:57:35 UTC
I think the easiest fix is to skip update runs if a previous one hasn't finished yet. Slow backends will be slow, no matter what we do.

Markmc gives me the impression that asking for fewer details when we list instances isn't going to speed matters up.

Comment 4 Chris Lalancette 2011-07-14 17:19:36 UTC
Actually, that is what condor does today.  If a batch status times out, it will skip it and try to re-ping it later.

That being said, there is likely a bug in there, in that it can take a long time for condor to re-activate the backend.  Also, I'm not sure how we would ever get out of this situation.  If the last status update took > 30 seconds (for instance), why won't the next one?

Comment 5 wes hayutin 2011-07-21 13:21:19 UTC
doc it

Comment 6 wes hayutin 2011-09-28 16:39:21 UTC
making sure all the bugs are at the right version for future queries

Comment 8 wes hayutin 2012-01-12 16:51:31 UTC
adding to sprint tracker

Comment 9 Angus Thomas 2012-01-12 17:30:01 UTC
Condor, whose operation this bug relates to, is no longer part of cloudforms.

This bug should either be refreshed with a description of the issue which pertains to the current software stack, or closed.

Comment 10 wes hayutin 2012-01-16 19:29:15 UTC
Dave.. double check how this now runs w/o condor

Comment 12 Dave Johnson 2012-02-01 19:10:35 UTC
This all seems to be good now with condor no longer part of the mix.

Marking this as verified 

aeolus-all-0.8.0-16.el6.noarch
aeolus-conductor-0.8.0-16.el6.noarch
aeolus-conductor-daemons-0.8.0-16.el6.noarch
aeolus-conductor-doc-0.8.0-16.el6.noarch

Comment 13 errata-xmlrpc 2012-05-15 21:44:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2012-0583.html


Note You need to log in before you can comment on or make changes to this bug.