Red Hat Bugzilla – Bug 534279
have plugins deployed in database, not filesystem
Last modified: 2013-04-30 19:32:03 EDT
Right now, agent plugins are deployed in the server's rhq-downloads/rhq-plugins. The agent goes through the server to download them (i.e. they go through the comm layer streaming, they do not go through normal HTTP GET). We could deploy the agent plugins in the database (i.e. store the actual .jar content in the DB, not in the filesystem).
We could use this to our advantage for two reasons:
1) today, if the server dies or the agent otherwise thinks it has to switch over to another server while the agent is downloading a plugin, the download will fail and the agent won't get the plugin. It would be nice if the switchover didn't affect it and the agent can therefore pick up where it left off - in other words, have it download the second part of the plugin from the second server.
2) today, to deploy a new plugin to the cloud, you have to deploy them to every server in the cloud (since yuo need them available to all agents to download). This means that two servers potentially will try to merge in metadata changes as both get the new plugin simulatenously. If you do it one by one, then some servers will have the old plugin and some the new - you have to wait unti all have their deployers detect it before they call can say they are updated. It would be nice to deploy once and have it go to the cloud in one shot.
What I suggest we have is a Administration page that lets you upload plugins. That would stream the .jar content to the RHQ_PLUGINS table. Now, when the server needs to parse the metadata, it could write it to java.io.tmpdir just so it can access the .jar file normally. But when an agent asks for the plugin, we could do some tricks to have the agent stream byte ranges from the server - if the server dies or connection is lost, the agent just asks for the next byte range of the plugin content from the second server. Complete failover for plugin downloads.
A poor mans solution that would solve problem #2 above but not #1 is that the Admin page takes an uploaded plugin, stores it in the DB, each server detects a new plugin getting deployed in the DB and can write out the new .jar to its rhq-downloads/rhq-plugins directory. When a server sees a new plugin in the DB it will assume which ever server stored it there already updated the metadata so it no longer has to be done. Doing this, the agent download code remains the same - since we still have individual plugin .jar copies in each server's rhq-plugins directory.
I found another reason to do this, albeit not your typical use case.
I had the perftest plugin. I put it on a server cloud. I rebuilt a new one and deployed it to another server, but the filename was different (same plugin , just different file name).
Well the filename got updated in teh database, but the other servers don't have that plugin with that filename so the agents failed to download them.
The new agent update download servlet is working nicely, I think we could do something similar. Instead of the file content being on the filesystem, it could be in the DB. Or, what we could do is at startup (and have a periodic job thereafter), the server could read the database, slurp the contents out of the database and reconstitute the plugin files on the local filesystem.
If the server's job notices a plugin changed, it can double check its list of plugins - if any are gone from the databse, it deletes them from the filesystem. It also writes the plugin jars to the filesystem that changed.
The big question here is : do we really want to store megabytes of content in the database (we will have the same issue in the future with AS cluster stuff).
Or would it e.g. make more sense to have one dedicated RHQ server in the cloud that is the designated server for such blob content, so that the agents connect to this dedicated one. Of course one can argue about this dedicated one failing and ... but if this server is down, I can imagine that customers have more important issue than just deploying a new plugin.
I say no for a couple reasons:
1) we built HA for the explicit purpose to NOT require a single specific server to be up. Our HA system is a true cloud, servers can come up and down and the agents will be fine. As soon as we require a "special" server to always be running, we break that entire design and we are back to a single-point of failure on the RHQ Server again. Not good. If a server is down, the customer should be concerned about that server yes, but should not be concerned that the entire cloud has a problem.
2) That would require server-to-server communications - something we are avoiding at all costs. Servers do not talk to each other directly.
We are ALREADY storing megabytes of data in the database - in fact, I'll argue we are already designed to store GIGAbytes of data in the DB (the entire content subsystem and server-side content plugin system does this). The plugins are relatively much more smaller than this. We only have at most (right now) ~20 plugins with an average size in the kilobyte range each - we'll have more data in the rhq_config tables :)
svn rev 2668
the server will now put the jar file contents in a blob column ("content") on the rhq_plugin table.
the server will read that binary data and store the plugin file on the file system when appropriate (this is so I didn't have to do any major refactoring in the server-side plugin deployment code - it all works the same because we ensure the latest plugins in the database are loaded on the file system at start time).
One caveat - might need to add more code to handle this edge-case that deals with hot-deployment:
1) server A starts up
2) someone deploys a new plugin P to another server in the cloud
3) an agent asks server A for updated plugins
4) server A will see plugin P in the database but server A does not yet have the plugin P on the filesystem
To fix this, either:
a) change the server code that handles agent plugin update requests to always stream the plugin content from the DB, not file system
b) document that users must restart all servers if new plugins are deployed (i.e. hot deployment only works on the server where you copied the plugin to)
c) change the server code to stream plugin content from DB to filesystem if an agent asks for a plugin that is not yet on the file system.
I'm leaning towards c) because it allows filesystem streaming 99% of the time (which I think is faster plus it reduces the load on the database) but it still allows for hot-deployment of plugins across a cloud without the need to restart servers. However, if I do c), it means that if a server goes down during plugin download, an agent will get a failure and will need to retry the plugin download (which, btw, it now does, so this really isn't a problem anymore - the agent will simply retry the download from another server).
BTW: a) sounds great in theory, but I just noticed that (while debugging using postgres) the JDBC driver's InputStream that I get back when reading the blob column is a java.io.ByteArrayInputStream with the entire blob data stored in that byte array in memory!!! Which totally defeats the purpose of JDBC BLOB streaming! So, doing a) has the very bad potential of causing an OutOfMemoryError if alot of agents simulteneously attempt to download very large plugins (like the JBossAS plugin).
this is major priority - it affects RHQ more so when deploying an N server cloud (where N > 1)
turns out the problem I mention above (with a) b) or c) solution) is already jira'ed as RHQ-1360
found out about another edge case that I need to figure out a solution to:
1) server A starts up
2) someone deploys an updated plugin P to another server in the cloud (the plugin already existed, P is just a modified version of that plugin)
3) an agent asks server A for updated plugins
4) server A will see plugin P in the database and server A has P on its filesystem, but its an old version and thus the agent will get that old version.
Need a way for the server A to do an MD5 check before serving up a plugin from the filesystem.
svn rev 2697 introduces more refactorings to get this to work better.
there is now our own extension to the jboss url deployment scanner - prior to scanning the file system, we scan the database for new/updated plugins. if we find one, we write it to the file system, then let the original file-scanner do its thing (at this point, it works just like it did before). this code performs necessary date/time checking and md5 checking to determine if we accept the db plugin or if we assume the existing file-based plugin should be kept.
I documented some use-cases we should test:
final major refactoring of ProductPluginDeployer should fix all remaining issues related to the db storage of plugin content.
Tested all the 11 test cases in 2 servers/2 agent HA environment as specified above.
RHEL5.3, x86_64, PostgreSQL8.2.4, java 1.6.0_11, JON RHQ SVN rev# 2894
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1090
This bug is related to RHQ-1069
This bug is related to RHQ-1557
This bug incorporates RHQ-1237