Red Hat Bugzilla – Bug 536581
avoid pushing schedules to all agents at startup
Last modified: 2013-04-30 19:32:59 EDT
At startup, I noticed that ResourceMetadataManager.updateMeasurementDefinitions ends up calling MeasurementScheduleManager.createSchedulesAndSendToAgents which attempts to ping all agents in the system and if the ping succeeds, it tries to push schedules to the agent.
We should avoid pushing to all agents at startup - this causes the server startup to take a long time. Need to come up with a way for all agents to update their schedules on their own time.
this is really critical, it looks like this happens inside of the tx of registerPlugin - which has a tx timeout of 10mins.
These definitely should not be pushed down like this. It looks like someone has removed the api in the DiscoveryAgentService that allowed the server to ask the agent to do an out of band update. Need to put this back and use that.
This also has to be fixed along with this - we need plugin deployment to happen BEFORE agent comm is started, otherwise, agents waiting at the gate to register will get bad, obsolete plugin information when it wants to update plugins:
(3:29:39 PM) mazz: the agent clients need to be started AFTER the comm layer is up - because an agent client might send a message that triggers the agent to immediately send am msg to the server
(3:30:04 PM) mazz: josep1: look at rev 1010 of StartupServlet
(3:30:21 PM) mazz: you moved the product plugin start to AFTER the comm layer starting
(3:30:42 PM) josep1: my svn comment "first, PluginDeployer was (for some select plugins) executing before the AgentClients were ready, which wasn't ready because the comm services weren't loaded yet;
(3:30:42 PM) josep1: swtich the order that the services are loaded in StartupServlet; "
(3:31:02 PM) mazz: I don't get that
(3:31:13 PM) josep1: neither do i, but i'm sure i had a good reason for it
(3:31:26 PM) josep1: did the plugin deployer every do any comm?
(3:31:37 PM) josep1: talk to agetnclient for some reason
(3:31:43 PM) mazz: so you want the agent clients to start after the plugins are deployed or after they are?
(3:31:50 PM) mazz: no - they can't
(3:31:54 PM) mazz: its just metadata
(3:32:03 PM) mazz: there is no agent stuff happening in there
(3:32:16 PM) mazz: that's the part I don't get
(3:32:43 PM) mazz: plugin deployment should occur before agent clients start up
(3:32:50 PM) mazz: but definitely should happen before the comm layer starts up
(3:33:04 PM) josep1: http://jira.rhq-project.org/browse/RHQ-592
(3:33:10 PM) josep1: sendSchedulesToAgents
(3:33:28 PM) josep1: updating of measuremnt definitions
(3:33:36 PM) mazz: whoa... you mean plugin deployment sends agent messages?
(3:33:43 PM) mazz: that should not be, IMHO
(3:33:47 PM) josep1: i guess at the time i wanted the agent clients to be ready so the schedule updates would succeed
(3:33:54 PM) josep1: hey, i didn't write that code ; )
(3:34:19 PM) josep1: and mazz, we discussed this a few weeks back, that you didn't like how that was done
(3:34:25 PM) josep1: i think there is another open jira, lemme look
(3:34:43 PM) josep1: http://jira.rhq-project.org/browse/RHQ-916
(3:34:44 PM) mazz: this is bad. because agents that are sitting waiting to register, will now immediately get in prior to the plugin deployments and will thus probably get obsolete plugin information
StartupServlet needs to do this:
// PUT THIS HERE - NEEDS TO HAPPEN BEFORE comm AND BEFORE agent clients START
// PLUGIN DEPLOYMENT MUST NOT TALK TO AGENTS - SHOULD JUST BE METADATA PROCESSING
startServerPluginContainer(); // before comm in case an agent wants to talk to it
// THIS IS BAD - MOVE THIS BEFORE COMM
RHQ-1326 will remove all agent comm from plugin deployment code. this issue will ensure we put the ordering back the way it was in StartupServlet.
RHQ-1370 has the job of refactoring the schedule update so the agents get their schedules synchronized properly.
the simplest way to test this is to get a set of servers/agents up and running (all agents registered and with resources imported).
Then shutdown all the agents and all servers.
Now, restart the server. You should see the server startup with no lag time and the server should not be attempting to send any data at all to the agents. If you see the startup time of the server be fast (like when the agents were running) and you see no exceptions in the server log talking about failures to talk to agents, then this issue can be considered fixed (this issue stopped the server from talking to agents during its startup).
Tested as specified above. The server does not talk to agents during its startup.
RHEL5.3, x86_64, PostgreSQL8.2.4, java 1.6.0_11, JON RHQ SVN rev# 2894
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-916
This bug is duplicated by RHQ-1186
This bug relates to RHQ-592
This bug relates to RHQ-1326
This bug relates to RHQ-1370