Saw this fly by once only 18:23:50,485 ERROR [JPADriftServerBean] Failed to store drift changeset [JPADriftChangeSet [id=10014, resource=Resource[id=10001, type=Mac OS X, key=snert, name=snert, parent=<null>, version=MacOSX 10.6.8], version=0]] java.lang.NullPointerException at org.rhq.core.pc.drift.DriftManager.ackChangeSet(DriftManager.java:634) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.rhq.enterprise.communications.command.impl.remotepojo.server.RemotePojoInvocationCommandService.execute(RemotePojoInvocationCommandService.java:184) at sun.reflect.GeneratedMethodAccessor79.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
Depending on which build this occurred on it most likely occurred on one of two potential lines of code. Either way, the NPE seems odd. In one case the DriftManager seems like it was not fully initialized, which would be very unlikely as the remoting is not started until after initialization (meaning the server could not have made the call until after initialization). In the second case it would mean that the coverage changeset was missing from the data directory. This shouldn't really happen unless the data directory (or at least the relevant changeset file in the data dir) was somehow wiped while the agent was running, or somehow the file was not present. I'm going to add some null handling and logging to protect here, not sure how to explain or reproduce. master commit 4a6ec5f97390cc5b6a1956b0b904162ea7b2b319 This is not really testable. Closing.
Re-opening this to include more discussion. After more evaluation it seems that what may have happened is that the plugin container was shut down (probably as part of an agent shutdown). PC shutdown currently takes place while the comm layer is still up, meaning that it's possible for the server to make an agent service call to a manager (like DriftManager, which happens to be the first of several managers to get shutdown, giving it the biggest window while the others are subsequently shutdown). The observed NPE could certainly happen if "AckChangeSet" is invoked after shutdown, as several global vars are nulled out. The question is what, if anything, to do about this. Although the NPE is ugly it in effect probably did the same thing as a more graceful exception would have done. If the comm layer was shut down then contacting the agent service would have failed in that way, or, if the DriftManager services guarded on state then AckChangeSet would have sent an IllegalStateException or something like that. In all cases the agent service would fail. Since the plugin container, can, I think be shut down outside of a complete agent shutdown, perhaps it's not feasible to stop the comm layer on PC shutdown. But, personally, I think we need a mechanism that can block agent service calls in a blanket fashion when the PC is being shut down, and it should start prior to the shutdown of the individual managers.
Remember that a "remote agent service" is nothing more than the PC's managers. For example, the "DiscoveryAgentService" *is* the InventoryManager. So when you say "shut down the comm layer", that could mean one of two things - the actual socket comm stuff (jboss/remoting) and the remote agent service endpoints. But leaving jboss/remoting aside, we technicaly DO shut down the remote agent services when we shutdown the PC because the PC managers ARE in fact the remote agent service endpoints. So when InventoryManager.shutdown is called (which it is when PluginContanier is shutdown) then the remote agent service is shutdown. I don't think there is much more we can do here. No matter what, as you said, "Although the NPE is ugly it in effect probably did the same thing as a more graceful exception would have done." The exceptions might be different, but SOME exception will have to result.
It seems then that the way to go is that the individual manager may need to simply protect itself, if it sees the need, from calls executing when the manager is down. For example, initialize() { ... initialized = true; } shutdown() { ... initialized = false; } someService { if (!initialized) { thrown IllegalStateException( "XXX Manager currently not available" ); } } I'm renaming this BZ and taking it away as a blocker.