When the plugin container is in the process of being shut down, it is common to see multiple NPE's and other exceptions from various threads in the PC that have no idea the PC is shutting down. Instead these threads should abort whatever they're doing as gracefully as possible, but in order for them to do that, they need a way to check what state services are currently in. Our new Arquillian itest framework restarts the plugin container 2 or 3 times before *each test class*. This results in a ton of extra noise in the test log from the stack traces of all of the "shutdown exceptions", and makes it hard to pick out any real errors in the log. --- To implement this, we'll need to add a ContainerServiceState enum with four values: INITIALIZING, INITIALIZED, SHUTTING_DOWN, SHUT_DOWN and one new method in the ContainerService interface: ContainerServiceState getState(); And then implement state management in each of our container services, including the plugin container itself.
Created attachment 574577 [details] a patch that implements this enhancement
The crux of this issue is that. in their shutdown() methods. the various PC subsystems call shutdown() or shutdownNow() on any ExecutorServices they have, but they don't wait for all threads from those ExecutorServices to terminate before returning. So our shutdown is not graceful/clean. When PluginContainer.shutdown() returns, numerous background threads owned by subsystems can still be running and causing NPE's because many objects within the plugin container have been destroyed/nulled. So we have an easy fix - we can use ExecutorService.awaitTermination(timeout) to wait for subsystem executor service threads to terminate. But we don't want to wait forever, so we can specify a reasonable timeout of say 3-5 minutes. This gives the subsystem threads a chance to finish up what they're doing and terminate, and will clean up the intermittent AS7 itest failures as well as the dozens of NPE's littering the AS7 itest log. This change really makes sense for production too, but since we are very close to the release of RHQ 4.4 and JON 3.1, we will make the behavior configurable in the PC configuration and disable it by default (i.e. the default behavior will be the same as it has been - to *not* wait for the subsystem threads to terminate). Two new PC config props will be added: - rhq.pc.wait-for-shutdown-service-termination (type=boolean, default=false) - rhq.pc.shutdown-service-termination-timeout (type=long, units=seconds, default=300)
Done in master as described in my previous comment: http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=a3d653c
It's been a while since the releases mentioned above, so activating this for production, and making a few tweaks. master commit 0f973b4b8ef0f9f261a6bcc7563d823907fa5f44 Jay Shaughnessy <jshaughn> Wed Feb 27 12:10:54 2013 -0500 A follow-on to commit 5ef9c47ee8edab0fd0558a208911cc31dfb5d1dc. It seems this existing BZ had the same intention and actually had support built-in to deal with this problem, but had not applied it to production. So: - Enable WAIT_FOR_SHUTDOWN_SERVICE_TERMINATION_DEFAULT such that we wait 5 minutes for executors to finish their work. This is good for operation execution, avail checking, and discovery. - Keep isRunning(), added in the previous commit, but make it a lightweight check by taking away the lock request. Continue to use it to safeguard against ERROR generation even if discovery takes longer than 5 minutes to complete. - Complement the outer guards in initialize() and shutdown() with guards that are inside the lock acquisition, to prevent sequentially executing the init or shutdown logic.
Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.