Bug 949065 - NPE when events published before plugin container fully starts
Summary: NPE when events published before plugin container fully starts
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: RHQ Project
Classification: Other
Component: Agent
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: RHQ 4.8
Assignee: Heiko W. Rupp
QA Contact: Mike Foley
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-04-05 19:01 UTC by Elias Ross
Modified: 2014-12-22 13:38 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2013-09-11 09:53:01 UTC
Embargoed:


Attachments (Terms of Use)
Proposed patch (7.00 KB, patch)
2013-06-19 09:26 UTC, Heiko W. Rupp
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 991149 0 unspecified CLOSED Refactor Plugin Container Services initialization 2021-02-22 00:41:40 UTC

Internal Links: 991149

Description Elias Ross 2013-04-05 19:01:34 UTC
Description of problem:


2013-04-05 18:56:45,757 ERROR [AvailabilityCollector-1] (EventManager)- Failed to add Events for Resource[id=92072, uuid=3b30ff13-e3b7-4d52-911f-26eb5cd11d63, type={verdad}Monitor, key=check_nfs_mounts, name=check_nfs_mounts, parent=Verdad] to Event report: [Event[id=0, source=null, timestamp=1365188145753, severity=ERROR, detail=NFS mounts missing: vp25q03ad-filer002-sec:/vol/vp2_share vp25q03ad-filer006:/vol/poc15 Found:  (check /etc/fstab)
]]
java.lang.NullPointerException
	at org.rhq.core.pc.event.EventManager.publishEvents(EventManager.java:112)
	at org.rhq.core.pc.event.EventContextImpl.publishEvent(EventContextImpl.java:67)

modules/core/plugin-container/src/main/java/org/rhq/core/pc/event/EventManager.java

    void publishEvents(@NotNull Set<Event> events, @NotNull Resource resource) {
        this.reportLock.readLock().lock();
        try {
            for (Event event : events) {
                EventSource eventSource = createEventSource(event, resource);
                this.activeReport.addEvent(event, eventSource);
^^^^^ I'm guessing 'activeReport' is null at this point
            }
        } catch (Throwable t) {
            log.error("Failed to add Events for " + resource + " to Event report: " + events, t);
        } finally {
            this.reportLock.readLock().unlock();
        }
    }

If a plugin takes time to start up, e.g. like the network adapters, then EventManager.initialize() has not been called yet.

Perhaps initialize() should be replaced with a proper constructor?

Comment 1 Jay Shaughnessy 2013-06-17 13:20:56 UTC
Hi Elias,

Could you post more of the stack?  Is this originating in the IRC plugin?

Comment 2 John Mazzitelli 2013-06-17 13:31:12 UTC
the easiest thing to do would be to just make a null check and skip the event if its null. If the agent/PC is still initializing/starting, then it seems we aren't ready to process events yet. We should ignore it and skip.

Comment 3 Elias Ross 2013-06-17 17:21:48 UTC
Dropping events means possibly dropping something important, so I'm not happy with the null check.

I don't have a full stack trace. But that's my component (started) publishing events.

I have a proper fix for this issue plus other race conditions in a patch, which is pending code review at Apple.

Comment 4 John Mazzitelli 2013-06-17 17:32:26 UTC
(In reply to Elias Ross from comment #3)
> Dropping events means possibly dropping something important, so I'm not
> happy with the null check.
> 
> I don't have a full stack trace. But that's my component (started)
> publishing events.

But if your agent was down (which it was, because you are starting it up in this scenario), you were already losing events. While the agent is starting up, its still in that same state in the sense you are losing events. When the agent finishes starting up, you get your events.

Curious to see the patch though.

Comment 5 Elias Ross 2013-06-17 20:50:38 UTC
You are kind of right. If the agent is down for a long time, events are lost.

The events I get are SNMP notifications, which are 'transacted' in the sense that the client will attempt to resend if the SNMP trap handler is down.

I also use events for other things. For example, if an availability check yields a warning or error, I create an event. This provides more information than 'up' or 'down'. I want those to be processed.

The agent going up or down is fairly rare, so it's not a huge deal, though.

Comment 6 Heiko W. Rupp 2013-06-19 08:27:04 UTC
I can now see how this can happen:

PluginContainer.initialize() does


-->            startContainerService(inventoryManager);
            startContainerService(measurementManager);
            startContainerService(configurationManager);
            startContainerService(operationManager);
            startContainerService(resourceFactoryManager);
            startContainerService(contentManager);
-->            startContainerService(eventManager);


The call to startContainerService(inventoryManager) tickles down and effectively calls 
component.start(context);

Now ResourceComponent.start() is meant to set up the EventContext inside the plugin, which normally means that the plugin starts a thread waiting for events etc.

Now when activating subsequent resources takes a long time (e.g. because connection setup to a remote is slow) or starting the other container services before the EventManager takes longer than it takes for the first event to arrive, then activeRecord can indeed be null.

Comment 7 Heiko W. Rupp 2013-06-19 09:26:39 UTC
Created attachment 762814 [details]
Proposed patch

Comment 8 Heiko W. Rupp 2013-06-19 13:04:37 UTC
Mazz reviewed my change

master 5dbd4d2

Comment 9 Heiko W. Rupp 2013-06-19 13:05:20 UTC
Thanks, Elias

Comment 10 Elias Ross 2013-06-20 02:13:30 UTC
I still have some structural changes I'd like to get in, and perhaps this belongs more in a new bug than one focused on this one bug.

Each component class in the plugin container relies on the initialization order to happen correctly. The proper fix is to use constructors (remove 'start' methods) and pass in the components that each component depends on. By removing calls to PluginContainer.getInstance().getXXX(), it avoids many potential race conditions. It also clearly identifies which components depend on which other components.

Anyway, I have a patch pending, so I will create a new bug once the patch gets approval.

Comment 11 Heiko W. Rupp 2013-06-20 08:08:17 UTC
Elias,
yes please open a new BZ - I have seen the potential to move that to constructors, but wanted to keep this change as small as possible - especially as we try to get out RHQ 4.8 out very soon. 

Thanks again.

Comment 12 Elias Ross 2013-08-01 18:29:25 UTC
The new BZ is 991149 for the refactoring.

Comment 13 Heiko W. Rupp 2013-09-11 09:53:01 UTC
Bulk closing of old issues now that HRQ 4.9 is in front of the door.

If you think the issue has not been solved, then please open a new bug and mention this one in the description.


Note You need to log in before you can comment on or make changes to this bug.