Bug 621228 - QMF calls from the cumin console timeout
Summary: QMF calls from the cumin console timeout
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: cumin   
(Show other bugs)
Version: 1.3
Hardware: All Linux
Target Milestone: 2.0
: ---
Assignee: Trevor McKay
QA Contact: Jan Sarenik
Depends On:
TreeView+ depends on / blocked
Reported: 2010-08-04 14:55 UTC by Ernie
Modified: 2011-03-03 18:59 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2011-03-03 18:59:28 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

Description Ernie 2010-08-04 14:55:39 UTC
When cumin is first started, there is a period of time that qmf calls will not succeeed due to timeouts.
I noticed that during this time there are a lot of agents being created:
2566 2010-08-04 10:33:52,378 DEBUG New agent Agent(v2) at bank 1.com.redhat.grid:slot:2f46b6b9-ad17-408f-bf82-57494982443f (QMFv2 Agent)
2566 2010-08-04 10:33:52,561 DEBUG New agent Agent(v2) at bank 1.com.redhat.grid:slot:d0500bea-422c-49b9-aa35-b681c2e5e2c8 (QMFv2 Agent)
2566 2010-08-04 10:33:52,695 DEBUG New agent Agent(v2) at bank 1.com.redhat.grid:slot:075c1df0-25cb-4bbb-a79e-439076a831a5 (QMFv2 Agent)
2566 2010-08-04 10:33:52,847 DEBUG New agent Agent(v2) at bank 1.com.redhat.grid:slot:b2e36081-1d16-43d9-ae7c-4ffe612da5dd (QMFv2 Agent)

For example, the negotiator call to GetRawConfig will timeout until all the agents are created. 

Once things settle down, the call succeeds with the following debug output:  
2566 2010-08-04 10:37:41,702 DEBUG New package com.redhat.grid
2566 2010-08-04 10:37:41,702 DEBUG New class com.redhat.grid:negotiator:_data(724d6159-593c-d727-7e01-441355cbb6ef)
2566 2010-08-04 10:37:42,474 DEBUG Method response for request 1280933642 received from Broker connected at: mrg31.lab.bos.redhat.com:5672
2566 2010-08-04 10:37:42,475 DEBUG Response: OK (0) - {u'Value': 'msg, grid, mgmt, rt'}

Happens in cumin version 4185

To Reproduce:
Start cumin-web --debug
Immediately try a qmf call: 
  - Grid Tab
  - Click on a collector
  - Negotiator tab
  - Click on a negotiator
After about 60 seconds, the call will timeout and the page will show an empty list.

Wait about 5 minutes and watch the cumin-web output until the new agent messages stop. Then reload the web page.
The call then succeeds and you see a list of groups and quotas.

This behavior is not limited to the GetRawConfig call. It also happens for the call to JobSummaries, GetLimits, SubmitJob, ect.

Comment 1 Justin Ross 2010-09-16 11:20:27 UTC
The options here are not great.  Cumin could refuse to service web requests until all the agents are synced up, but there is no defined point at which that happens.  Some kind of heuristic would be necessary.

I think the cumin qmf call code is already doing the right thing.  It's waiting for the agent to come in, until it times out.

If there is something more we can do about it, we should do it after 1.3.

Comment 2 Matthew Farrellee 2011-02-01 13:01:16 UTC
Are the timeouts reflected to the user? For things like JobSummaries etc, do we have data accessing indicators?

Comment 3 Jan Sarenik 2011-03-01 10:48:51 UTC
I am unable to reproduce this on cumin-0.1.4560-1.el5
Was it fixed already?

Comment 4 Ernie 2011-03-01 13:40:59 UTC
It was not fixed, however, it may only happen on slow connections. When I'm connected to the grid0 broker over vpn it takes a few minutes after I start cumin-web before qmf calls can be made. 
The error message is either:
"Agent grid0.lab.bos.redhat.com is unknown"
or a timeout error message.

As Justin mentioned, there may not be anything we can do about this other than display the appropriate waiting/error messages. That was covered in a separate BZ. Unless there is some magical way to ensure all agents that can make qmf calls have synced up before we get too busy handling the slot agents, I'd say we could close this BZ.

Comment 5 Ernie 2011-03-01 13:41:43 UTC
Forgot to clear the needinfo flag.

Note You need to log in before you can comment on or make changes to this bug.