When cumin is first started, there is a period of time that qmf calls will not succeeed due to timeouts. I noticed that during this time there are a lot of agents being created: 2566 2010-08-04 10:33:52,378 DEBUG New agent Agent(v2) at bank 1.com.redhat.grid:slot:2f46b6b9-ad17-408f-bf82-57494982443f (QMFv2 Agent) 2566 2010-08-04 10:33:52,561 DEBUG New agent Agent(v2) at bank 1.com.redhat.grid:slot:d0500bea-422c-49b9-aa35-b681c2e5e2c8 (QMFv2 Agent) 2566 2010-08-04 10:33:52,695 DEBUG New agent Agent(v2) at bank 1.com.redhat.grid:slot:075c1df0-25cb-4bbb-a79e-439076a831a5 (QMFv2 Agent) 2566 2010-08-04 10:33:52,847 DEBUG New agent Agent(v2) at bank 1.com.redhat.grid:slot:b2e36081-1d16-43d9-ae7c-4ffe612da5dd (QMFv2 Agent) For example, the negotiator call to GetRawConfig will timeout until all the agents are created. Once things settle down, the call succeeds with the following debug output: 2566 2010-08-04 10:37:41,702 DEBUG New package com.redhat.grid 2566 2010-08-04 10:37:41,702 DEBUG New class com.redhat.grid:negotiator:_data(724d6159-593c-d727-7e01-441355cbb6ef) 2566 2010-08-04 10:37:42,474 DEBUG Method response for request 1280933642 received from Broker connected at: mrg31.lab.bos.redhat.com:5672 2566 2010-08-04 10:37:42,475 DEBUG Response: OK (0) - {u'Value': 'msg, grid, mgmt, rt'} Happens in cumin version 4185 To Reproduce: Start cumin-web --debug Immediately try a qmf call: - Grid Tab - Click on a collector - Negotiator tab - Click on a negotiator After about 60 seconds, the call will timeout and the page will show an empty list. Wait about 5 minutes and watch the cumin-web output until the new agent messages stop. Then reload the web page. The call then succeeds and you see a list of groups and quotas. This behavior is not limited to the GetRawConfig call. It also happens for the call to JobSummaries, GetLimits, SubmitJob, ect.
The options here are not great. Cumin could refuse to service web requests until all the agents are synced up, but there is no defined point at which that happens. Some kind of heuristic would be necessary. I think the cumin qmf call code is already doing the right thing. It's waiting for the agent to come in, until it times out. If there is something more we can do about it, we should do it after 1.3.
Are the timeouts reflected to the user? For things like JobSummaries etc, do we have data accessing indicators?
I am unable to reproduce this on cumin-0.1.4560-1.el5 Was it fixed already?
It was not fixed, however, it may only happen on slow connections. When I'm connected to the grid0 broker over vpn it takes a few minutes after I start cumin-web before qmf calls can be made. The error message is either: "Agent grid0.lab.bos.redhat.com is unknown" or a timeout error message. As Justin mentioned, there may not be anything we can do about this other than display the appropriate waiting/error messages. That was covered in a separate BZ. Unless there is some magical way to ensure all agents that can make qmf calls have synced up before we get too busy handling the slot agents, I'd say we could close this BZ.
Forgot to clear the needinfo flag.