Description of problem: Lately the importance of watchman has increased due to its handling of throttling. Ops has added monitoring for this service and therefore we have seen, in one week, 750 restarts for the watchman service. This is what we see in the logs: Aug 20 16:14:40 ex-std-node148 rhc-watchman[31800]: Running rhc-watchman => delay: 60s, exception threshold: 5 Aug 20 16:14:41 ex-std-node148 rhc-watchman[31800]: watchman caught #<ArgumentError: invalid date>: invalid date. Retries left: 4 It appears there is an invalid date that is being parsed over and over which causes watchman to fail after 10 exceptions. Version-Release number of selected component (if applicable): Current How reproducible: Very. Steps to Reproduce: 1. Create a date that is unreadable for rhc-watchman 2. View the exception/retry loop in /var/log/messages 3. Actual results: rhc-watchman dies after 10 exceptions are caught. Expected results: rhc-watchman should attempt to parse the incorrect date. If it fails after a number of attempts it should correct the date. If it is unable to do so, print an exception to syslog and continue processing. Additional info: This is happening on a high number of our servers. This needs to be repaired/fixed/patched.
Here is the specific line that is killing the parser: --------- 2013/02013/05/13 12:24:45,313 ERROR [stderr] (Thread-0 (HornetQ-scheduled-threads-33041257)) Exception in thread "Thread-0 (HornetQ-scheduled-threads-33041257)" java.lang.OutOfMemoryError: unable to create new native thread --------- This happens on the strptime call: ------------ File.open(server_log).grep(/ java.lang.OutOfMemoryError/) {|msg| Syslog.info("MSG => #{msg}") #DEBUG # timezones are just a PITA. server.log message doesn't include timezone or date so inject both from today ts = DateTime.strptime(msg, '%Y/%m/%d %T') timestamp = DateTime.civil(ts.year, ts.month, ts.day, ts.hour, ts.min, ts.sec, epoch.zone) next if epoch > timestamp. cache_incident(uuid, timestamp, app_env) } -------------
The file that is causing the problem: /var/lib/openshift/518809f74382ec8dfb000015/jbossas//logs/server.log
I have fix which make watchman ignore the invalid timestamps with a warning dumped to syslog: https://github.com/openshift/li/pull/1843
Aug 30 10:11:44 ex-std-node74 rhc-watchman[15274]: watchman caught #<ArgumentError: invalid date>: invalid date. Retries left: 0 We are still seeing this issue in production on a few of our nodes. I added some debugging inside of watchman to see where it was failing. I see its failing to parse the java server.log Aug 30 10:51:54 ex-std-node148 rhc-watchman[25453]: PROBLEM PARSING: /var/lib/openshift/518809f74382ec8dfb000015/jbossas//logs/server.log Here is the actual line it is failing on. 2013/02013/05/13 12:24:45,313 ERROR [stderr] (Thread-0 (HornetQ-scheduled-threads-33041257)) Exception in thread "Thread-0 (HornetQ-scheduled-threads-33041257)" java.lang.OutOfMemoryError: unable to create new native thread ./rhc-watchman:156:in `strptime'
Commit pushed to master at https://github.com/openshift/li https://github.com/openshift/li/commit/bac7a1dfb4af4765815aadaf51c9a134c8ab3575 Bug 999183 - Make rhc-watchman ignore messages with invalid timestamp
Checked on devenv_3745, With about 50 jbossas apps on node. Modify the timestamp for the server.log to some invalid value. Like the comment#4. Restart the gear. Watch the /var/log/messages There is no date related error, and the rhc-watchman did not restart. Move bug to verified.