535885 – (RHQ-292) re-evaluate persistent fifo storage mechanisms for server->agent comm

Bug 535885 (RHQ-292) - re-evaluate persistent fifo storage mechanisms for server->agent comm

Summary: re-evaluate persistent fifo storage mechanisms for server->agent comm

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	RHQ-292
Product:	RHQ Project
Classification:	Other
Component:	Performance
Sub Component:
Version:	unspecified
Hardware:	All
OS:	All
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	John Mazzitelli
QA Contact:	Corey Welton
Docs Contact:
URL:	http://jira.rhq-project.org/browse/RH...
Whiteboard:
Depends On:
Blocks:	RHQ-644
TreeView+	depends on / blocked

Reported:	2008-04-14 16:06 UTC by John Mazzitelli
Modified:	2009-11-10 21:22 UTC (History)
CC List:	0 users
Fixed In Version:	1.2
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)

Description John Mazzitelli 2008-04-14 16:06:00 UTC

Right now, when a remote endpoint is down, the comm layer will persist guaranteed messages to the filesystem - the PersistentFifo file.

For agents, this is fine.

For server, maybe not so fine, especially for large environments of thousands of agents.

I would like to have the ClientCommandSender's PersistentFifo object be backed by a database table, not a filesystem file. In this case, when the server sees a downed agent, it won't try to write to a file, but it will persist the messages in a database.

Need to be careful about clusters servers - what if two servers try to communicate with the downed agent? In this case, its probably better that we use a DB so we can transactionally access the persisted fifo data.

Perhaps we have PersistedFifo be given a DataSource from which it can a) create the table if it doesn't exist yet and b) read/write the messages to the table. Lots of things to think about but we should consider another solution than what we have today, and that is creating one fifo .dat file per agent.

Comment 1 Jason Dobies 2008-04-24 18:59:44 UTC

As far as I can see, the only server -> agent APIs that are declared as guaranteed are:

DiscoveryAgentService:
void synchronizeInventory(int resourceId, EnumSet<SynchronizationType> synchronizationTypes);
void removeResource(int resourceId);

I'm not fully sure when synchronizeInventory would be called, but that's the only one I'd worry would get bogged down with any regularity to cause the persistent FIFO to kick in.

Comment 2 Jay Shaughnessy 2008-08-21 16:32:48 UTC

This topic has been re-visited with the introduction of multi-server HA support:

Design Decision

The decision is to remove use of server->agent reliable messaging and to minimally add necessary comments to indicate in the code that the @Asynchronous( guaranteedDelivery="true") annotation should not be used for AgentService services (i.e. Services defined for Server->Agent communication, org.rhq.core.clientapi.agent.*.*AgentService.java). This annotation will be removed in:

* DiscoveryAgentService.synchronizeInventory() : notification of newly committed resources in AD portlet
* DiscoveryAgentService.removeResource() : notification of uninventoried resources
In both of these cases the new, more robust synchronization algorithms for 1.1 will ensure proper synch on agent startup regardless of the delivery of these messages.

A third scenario discussed was notification of updated (measurement) schedules. It turns out that reliable messaging was not in place for these updates and the agent must be up for the server-side update to succeed. So, it was a non-issue. If this behavior needed to change it could be handled by adding a getLatestSchedulesForResourceId for modified resources (see InventoryManager.synchInventory), or if that is too coarse, a new update time specific to schedule update.

Or, in the case above, or in general, if we need to re-introduce reliable server->agent messaging we can revisit the options listed above, particularly RHQ-292.

r1272 Enforcing the new policy of no reliable messaging on calls from server->agent. Removed guaranteedDelivery option where is was used and added the ability for a ClientRemotePojoFactory to disable the option for any proxies generated by the factory. If found at runtime the option is forced to false and an error is generated in the log (to notify devs that they've made a mistake or need to revisit the solution).

Comment 3 Jay Shaughnessy 2008-09-16 15:41:01 UTC

Moving to 1.2 since we should at least evaluate options and pick (assuming we ever want server->agent comm), but for now choices for 1.1 make that discussion moot.

Comment 4 John Mazzitelli 2009-01-27 15:16:20 UTC

even if we have no server->agent guaranteed messages (not sure if this is true today), i think some code has to be changed to ensure the server never writes this command spool .dat file (I believe just a change to the comm configuraiton file on the server will turn this off).

Comment 5 John Mazzitelli 2009-01-27 15:52:09 UTC

as of today, there is no guaranteed delivery for any server->agent messages (to confirm, do a search for all usages of org.rhq.core.communications.command.annotation.Asynchronous - there should be none in any *AgentService interfaces).

Comment 6 John Mazzitelli 2009-01-27 16:25:28 UTC

server will no longer create the spool files.

to test this, simply start a server that has one or more agents connected to it. You should no longer see any command_spool.dat files in the jbossas/server/default/data directory.

Comment 7 Charles Crouch 2009-01-27 16:40:43 UTC

where 32163 is the pid of the JON Server JVM

-bash-3.00$ ls -la /proc/32163/fd > /tmp/jon-files.txt
-bash-3.00$ cat /tmp/jon-files.txt |  wc -l
1027
-bash-3.00$ cat /tmp/jon-files.txt | grep command-spool | wc -l
806

So 80% of the server's open files were coming from the these command spool files.

Comment 8 Corey Welton 2009-02-10 18:59:52 UTC

QA Verified, spool files no longer occur.

Comment 9 Red Hat Bugzilla 2009-11-10 21:06:21 UTC

This bug was previously known as http://jira.rhq-project.org/browse/RHQ-292
This bug relates to RHQ-1408

Note You need to log in before you can comment on or make changes to this bug.