Right now, when a remote endpoint is down, the comm layer will persist guaranteed messages to the filesystem - the PersistentFifo file. For agents, this is fine. For server, maybe not so fine, especially for large environments of thousands of agents. I would like to have the ClientCommandSender's PersistentFifo object be backed by a database table, not a filesystem file. In this case, when the server sees a downed agent, it won't try to write to a file, but it will persist the messages in a database. Need to be careful about clusters servers - what if two servers try to communicate with the downed agent? In this case, its probably better that we use a DB so we can transactionally access the persisted fifo data. Perhaps we have PersistedFifo be given a DataSource from which it can a) create the table if it doesn't exist yet and b) read/write the messages to the table. Lots of things to think about but we should consider another solution than what we have today, and that is creating one fifo .dat file per agent.
As far as I can see, the only server -> agent APIs that are declared as guaranteed are: DiscoveryAgentService: void synchronizeInventory(int resourceId, EnumSet<SynchronizationType> synchronizationTypes); void removeResource(int resourceId); I'm not fully sure when synchronizeInventory would be called, but that's the only one I'd worry would get bogged down with any regularity to cause the persistent FIFO to kick in.
This topic has been re-visited with the introduction of multi-server HA support: Design Decision The decision is to remove use of server->agent reliable messaging and to minimally add necessary comments to indicate in the code that the @Asynchronous( guaranteedDelivery="true") annotation should not be used for AgentService services (i.e. Services defined for Server->Agent communication, org.rhq.core.clientapi.agent.*.*AgentService.java). This annotation will be removed in: * DiscoveryAgentService.synchronizeInventory() : notification of newly committed resources in AD portlet * DiscoveryAgentService.removeResource() : notification of uninventoried resources In both of these cases the new, more robust synchronization algorithms for 1.1 will ensure proper synch on agent startup regardless of the delivery of these messages. A third scenario discussed was notification of updated (measurement) schedules. It turns out that reliable messaging was not in place for these updates and the agent must be up for the server-side update to succeed. So, it was a non-issue. If this behavior needed to change it could be handled by adding a getLatestSchedulesForResourceId for modified resources (see InventoryManager.synchInventory), or if that is too coarse, a new update time specific to schedule update. Or, in the case above, or in general, if we need to re-introduce reliable server->agent messaging we can revisit the options listed above, particularly RHQ-292. r1272 Enforcing the new policy of no reliable messaging on calls from server->agent. Removed guaranteedDelivery option where is was used and added the ability for a ClientRemotePojoFactory to disable the option for any proxies generated by the factory. If found at runtime the option is forced to false and an error is generated in the log (to notify devs that they've made a mistake or need to revisit the solution).
Moving to 1.2 since we should at least evaluate options and pick (assuming we ever want server->agent comm), but for now choices for 1.1 make that discussion moot.
even if we have no server->agent guaranteed messages (not sure if this is true today), i think some code has to be changed to ensure the server never writes this command spool .dat file (I believe just a change to the comm configuraiton file on the server will turn this off).
as of today, there is no guaranteed delivery for any server->agent messages (to confirm, do a search for all usages of org.rhq.core.communications.command.annotation.Asynchronous - there should be none in any *AgentService interfaces).
server will no longer create the spool files. to test this, simply start a server that has one or more agents connected to it. You should no longer see any command_spool.dat files in the jbossas/server/default/data directory.
where 32163 is the pid of the JON Server JVM -bash-3.00$ ls -la /proc/32163/fd > /tmp/jon-files.txt -bash-3.00$ cat /tmp/jon-files.txt | wc -l 1027 -bash-3.00$ cat /tmp/jon-files.txt | grep command-spool | wc -l 806 So 80% of the server's open files were coming from the these command spool files.
QA Verified, spool files no longer occur.
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-292 This bug relates to RHQ-1408