Description of problem: 2013-04-05 18:56:45,757 ERROR [AvailabilityCollector-1] (EventManager)- Failed to add Events for Resource[id=92072, uuid=3b30ff13-e3b7-4d52-911f-26eb5cd11d63, type={verdad}Monitor, key=check_nfs_mounts, name=check_nfs_mounts, parent=Verdad] to Event report: [Event[id=0, source=null, timestamp=1365188145753, severity=ERROR, detail=NFS mounts missing: vp25q03ad-filer002-sec:/vol/vp2_share vp25q03ad-filer006:/vol/poc15 Found: (check /etc/fstab) ]] java.lang.NullPointerException at org.rhq.core.pc.event.EventManager.publishEvents(EventManager.java:112) at org.rhq.core.pc.event.EventContextImpl.publishEvent(EventContextImpl.java:67) modules/core/plugin-container/src/main/java/org/rhq/core/pc/event/EventManager.java void publishEvents(@NotNull Set<Event> events, @NotNull Resource resource) { this.reportLock.readLock().lock(); try { for (Event event : events) { EventSource eventSource = createEventSource(event, resource); this.activeReport.addEvent(event, eventSource); ^^^^^ I'm guessing 'activeReport' is null at this point } } catch (Throwable t) { log.error("Failed to add Events for " + resource + " to Event report: " + events, t); } finally { this.reportLock.readLock().unlock(); } } If a plugin takes time to start up, e.g. like the network adapters, then EventManager.initialize() has not been called yet. Perhaps initialize() should be replaced with a proper constructor?
Hi Elias, Could you post more of the stack? Is this originating in the IRC plugin?
the easiest thing to do would be to just make a null check and skip the event if its null. If the agent/PC is still initializing/starting, then it seems we aren't ready to process events yet. We should ignore it and skip.
Dropping events means possibly dropping something important, so I'm not happy with the null check. I don't have a full stack trace. But that's my component (started) publishing events. I have a proper fix for this issue plus other race conditions in a patch, which is pending code review at Apple.
(In reply to Elias Ross from comment #3) > Dropping events means possibly dropping something important, so I'm not > happy with the null check. > > I don't have a full stack trace. But that's my component (started) > publishing events. But if your agent was down (which it was, because you are starting it up in this scenario), you were already losing events. While the agent is starting up, its still in that same state in the sense you are losing events. When the agent finishes starting up, you get your events. Curious to see the patch though.
You are kind of right. If the agent is down for a long time, events are lost. The events I get are SNMP notifications, which are 'transacted' in the sense that the client will attempt to resend if the SNMP trap handler is down. I also use events for other things. For example, if an availability check yields a warning or error, I create an event. This provides more information than 'up' or 'down'. I want those to be processed. The agent going up or down is fairly rare, so it's not a huge deal, though.
I can now see how this can happen: PluginContainer.initialize() does --> startContainerService(inventoryManager); startContainerService(measurementManager); startContainerService(configurationManager); startContainerService(operationManager); startContainerService(resourceFactoryManager); startContainerService(contentManager); --> startContainerService(eventManager); The call to startContainerService(inventoryManager) tickles down and effectively calls component.start(context); Now ResourceComponent.start() is meant to set up the EventContext inside the plugin, which normally means that the plugin starts a thread waiting for events etc. Now when activating subsequent resources takes a long time (e.g. because connection setup to a remote is slow) or starting the other container services before the EventManager takes longer than it takes for the first event to arrive, then activeRecord can indeed be null.
Created attachment 762814 [details] Proposed patch
Mazz reviewed my change master 5dbd4d2
Thanks, Elias
I still have some structural changes I'd like to get in, and perhaps this belongs more in a new bug than one focused on this one bug. Each component class in the plugin container relies on the initialization order to happen correctly. The proper fix is to use constructors (remove 'start' methods) and pass in the components that each component depends on. By removing calls to PluginContainer.getInstance().getXXX(), it avoids many potential race conditions. It also clearly identifies which components depend on which other components. Anyway, I have a patch pending, so I will create a new bug once the patch gets approval.
Elias, yes please open a new BZ - I have seen the potential to move that to constructors, but wanted to keep this change as small as possible - especially as we try to get out RHQ 4.8 out very soon. Thanks again.
The new BZ is 991149 for the refactoring.
Bulk closing of old issues now that HRQ 4.9 is in front of the door. If you think the issue has not been solved, then please open a new bug and mention this one in the description.