Red Hat Bugzilla – Bug 1006619
Log events sources are unstable and events are lost due to Log4j log parsing not thread safe
Last modified: 2014-01-02 15:38:28 EST
Description of problem:
Even though log messages are logged to a resource's log file and events had originally been reported for that source, from time to time events are simply missed and can results in loss of event history and alert failures. This requires log monitoring to be performed outside of JBoss ON to verify JBoss ON is working.
Version-Release number of selected component (if applicable):
It is not clear how or why this issue occurs. In the reported case, a log message was written to the log file and after several minutes, the log event was never triggered. This is not always the case. Sometimes the log event generates the expected result but not always.
The original working theory was related to disk caching and log file rolling. However, if appears that even after a log file has been written (verified by tailing the log file at the command-line) the agent just doesn't detect the log being updated.
Further research revealed that bug 846082 had been logged sometime ago that may explain this sporadic behavior.
I created a test to reproduce the issues described in bug 846082. I reverted the changes in Log4JLogEntryProcessor, making the DateFormat fields static again. The test consistently fails. If you make them instance fields, the test consistently passes.
I did my work in the branch bug/1006619 which has been pushed to origin,
For reference, here is what some of the exceptions look like:
java.lang.NumberFormatException: For input string: "E.423021313E4"
Keep in mind that due to the lack of exception handling in 3.1.2, these errors would go completely unreported. The changes for bug 846082 adds exception handling that captures any RuntimeExceptions.
[22:44:25] <loleary> Well, 6619 is actually fixed by 846082... 6619 can go to ON_QA as it was already fixed (jsanda can confirm) in ER01.
[22:45:58] <loleary> The only reason 9666 related to 6619 was because it is preventing one from actually testing whether 6619 is fixed or not.
I found bug 1017214. What should be done with this bug? Thanks
I do not think bug 1017214 is related so I will remove from the dependency list.
Created attachment 812135 [details]
All events from the log files are reported on the web.
Tested with 15 log files when messages were simultaneously generated into files using a script. After processes generating the messages were stopped. Then number of events (considered levels: INFO, WARN, ERROR, FATAL) on the web and in all the log files was same.
See the attached screenshot.