Bug 1006619 - Log events sources are unstable and events are lost due to Log4j log parsing not thread safe
Log events sources are unstable and events are lost due to Log4j log parsing ...
Status: CLOSED CURRENTRELEASE
Product: JBoss Operations Network
Classification: JBoss
Component: Monitoring - Events (Show other bugs)
JON 3.1.2
Unspecified Unspecified
high Severity urgent
: ER01
: JON 3.2.0
Assigned To: John Sanda
Mike Foley
:
Depends On: 846082
Blocks: 1015715
  Show dependency treegraph
 
Reported: 2013-09-10 19:16 EDT by Larry O'Leary
Modified: 2014-01-02 15:38 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1015715 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
All events from the log files are reported on the web. (183.31 KB, image/png)
2013-10-14 13:45 EDT, jvlasak
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 446653 None None None Never

  None (edit)
Description Larry O'Leary 2013-09-10 19:16:14 EDT
Description of problem:
Even though log messages are logged to a resource's log file and events had originally been reported for that source, from time to time events are simply missed and can results in loss of event history and alert failures. This requires log monitoring to be performed outside of JBoss ON to verify JBoss ON is working.

Version-Release number of selected component (if applicable):
4.4.0.JON312GA

It is not clear how or why this issue occurs. In the reported case, a log message was written to the log file and after several minutes, the log event was never triggered. This is not always the case. Sometimes the log event generates the expected result but not always.

The original working theory was related to disk caching and log file rolling. However, if appears that even after a log file has been written (verified by tailing the log file at the command-line) the agent just doesn't detect the log being updated.

Further research revealed that bug 846082 had been logged sometime ago that may explain this sporadic behavior.
Comment 1 John Sanda 2013-09-13 14:09:44 EDT
I created a test to reproduce the issues described in bug 846082. I reverted the changes in Log4JLogEntryProcessor, making the DateFormat fields static again. The test consistently fails. If you make them instance fields, the test consistently passes.

I did my work in the branch bug/1006619 which has been pushed to origin,

https://git.fedorahosted.org/cgit/rhq/rhq.git/log/?h=bug/1006619

For reference, here is what some of the exceptions look like:

java.lang.NumberFormatException: For input string: "E.423021313E4"
	at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)
	at java.lang.Double.parseDouble(Double.java:540)
	at java.text.DigitList.getDouble(DigitList.java:168)
	at java.text.DecimalFormat.parse(DecimalFormat.java:1321)
	at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:2088)
	at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1455)
	at java.text.DateFormat.parse(DateFormat.java:355)
	at org.rhq.core.pluginapi.event.log.Log4JLogEntryProcessor.parseDateString(Log4JLogEntryProcessor.java:117)
	at org.rhq.core.pluginapi.event.log.Log4JLogEntryProcessor.processPrimaryLine(Log4JLogEntryProcessor.java:94)
	at org.rhq.core.pluginapi.event.log.MultiLineLogEntryProcessor.processLine(MultiLineLogEntryProcessor.java:98)
	at org.rhq.core.pluginapi.event.log.MultiLineLogEntryProcessor.processLines(MultiLineLogEntryProcessor.java:70)
	at org.rhq.core.pluginapi.event.log.Log4JEventsTest$Producer.run(Log4JEventsTest.java:46)


Keep in mind that due to the lack of exception handling in 3.1.2, these errors would go completely unreported. The changes for bug 846082 adds exception handling that captures any RuntimeExceptions.
Comment 2 Heiko W. Rupp 2013-09-24 16:46:42 EDT
Setting ON_QA

[22:44:25] <loleary>	 Well, 6619 is actually fixed by 846082... 6619 can go to ON_QA as it was already fixed (jsanda can confirm) in ER01.

[22:45:58] <loleary>	 The only reason 9666 related to 6619 was because it is preventing one from actually testing whether 6619 is fixed or not.
Comment 3 jvlasak 2013-10-09 09:00:43 EDT
I found bug 1017214. What should be done with this bug? Thanks
Comment 4 Larry O'Leary 2013-10-09 09:28:20 EDT
I do not think bug 1017214 is related so I will remove from the dependency list.
Comment 5 jvlasak 2013-10-14 13:45:48 EDT
Created attachment 812135 [details]
All events from the log files are reported on the web.

Verified. 
Tested with 15 log files when messages were simultaneously generated into files using a script. After processes generating the messages were stopped. Then number of events (considered levels: INFO, WARN, ERROR, FATAL) on the web and in all the log files was same.
See the attached screenshot.

Note You need to log in before you can comment on or make changes to this bug.