808231 – [plugin-container] NullPointerExceptions (NPE's) occur during PC shutdown, because the PC does not wait for ExecutorServices that have been shut down to terminate

Bug 808231 - [plugin-container] NullPointerExceptions (NPE's) occur during PC shutdown, because the PC does not wait for ExecutorServices that have been shut down to terminate

Summary: [plugin-container] NullPointerExceptions (NPE's) occur during PC shutdown, be...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Plugin Container
Sub Component:
Version:	4.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHQ 4.4.0
Assignee:	Charles Crouch
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-03-29 21:29 UTC by Ian Springer
Modified:	2015-02-01 23:27 UTC (History)
CC List:	4 users (show)
Fixed In Version:	4.4
Clone Of:
Environment:
Last Closed:	2013-09-01 10:07:57 UTC
Embargoed:

Attachments	(Terms of Use)
a patch that implements this enhancement (90.98 KB, patch) 2012-04-02 17:45 UTC, Ian Springer	no flags	Details \| Diff
View All

Description Ian Springer 2012-03-29 21:29:23 UTC

When the plugin container is in the process of being shut down, it is common to see multiple NPE's and other exceptions from various threads in the PC that have no idea the PC is shutting down. Instead these threads should abort whatever they're doing as gracefully as possible, but in order for them to do that, they need a way to check what state services are currently in.

Our new Arquillian itest framework restarts the plugin container 2 or 3 times before *each test class*. This results in a ton of extra noise in the test log from the stack traces of all of the "shutdown exceptions", and makes it hard to pick out any real errors in the log.

---

To implement this, we'll need to add a ContainerServiceState enum with four values:

    INITIALIZING,
    INITIALIZED,
    SHUTTING_DOWN,
    SHUT_DOWN

and one new method in the ContainerService interface:

    ContainerServiceState getState();

And then implement state management in each of our container services, including the plugin container itself.

Comment 1 Ian Springer 2012-04-02 17:45:28 UTC

Created attachment 574577 [details]
a patch that implements this enhancement

Comment 2 Ian Springer 2012-05-01 15:45:44 UTC

The crux of this issue is that. in their shutdown() methods. the various PC subsystems call shutdown() or shutdownNow() on any ExecutorServices they have, but they don't wait for all threads from those ExecutorServices to terminate before returning. So our shutdown is not graceful/clean. When PluginContainer.shutdown() returns, numerous background threads owned by subsystems can still be running and causing NPE's because many objects within the plugin container have been destroyed/nulled.

So we have an easy fix - we can use ExecutorService.awaitTermination(timeout) to wait for subsystem executor service threads to terminate. But we don't want to wait forever, so we can specify a reasonable timeout of say 3-5 minutes. This gives the subsystem threads a chance to finish up what they're doing and terminate, and will clean up the intermittent AS7 itest failures as well as the dozens of NPE's littering the AS7 itest log.

This change really makes sense for production too, but since we are very close to the release of RHQ 4.4 and JON 3.1, we will make the behavior configurable in the PC configuration and disable it by default (i.e. the default behavior will be the same as it has been - to *not* wait for the subsystem threads to terminate).

Two new PC config props will be added:

 - rhq.pc.wait-for-shutdown-service-termination (type=boolean, default=false)
 - rhq.pc.shutdown-service-termination-timeout (type=long, units=seconds, default=300)

Comment 3 Ian Springer 2012-05-01 15:55:06 UTC

Done in master as described in my previous comment:

http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=a3d653c

Comment 4 Jay Shaughnessy 2013-02-27 18:05:30 UTC

It's been a while since the releases mentioned above, so activating this for production, and making a few tweaks.

master commit 0f973b4b8ef0f9f261a6bcc7563d823907fa5f44
Jay Shaughnessy <jshaughn>
Wed Feb 27 12:10:54 2013 -0500

A follow-on to commit 5ef9c47ee8edab0fd0558a208911cc31dfb5d1dc.  It seems this
existing BZ had the same intention and actually had support built-in to
deal with this problem, but had not applied it to production. So:
- Enable WAIT_FOR_SHUTDOWN_SERVICE_TERMINATION_DEFAULT such that we wait
  5 minutes for executors to finish their work.  This is good for
  operation execution, avail checking, and discovery.
- Keep isRunning(), added in the previous commit, but make it a lightweight
  check by taking away the lock request. Continue to use it to safeguard
  against ERROR generation even if discovery takes longer than 5 minutes to
  complete.
- Complement the outer guards in initialize() and shutdown() with guards
  that are inside the lock acquisition, to prevent sequentially executing
  the init or shutdown logic.

Comment 7 Heiko W. Rupp 2013-09-01 10:07:57 UTC

Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.

Note You need to log in before you can comment on or make changes to this bug.