Bug 999183 - [watchman] ArgumentError when parsing date
[watchman] ArgumentError when parsing date
Product: OpenShift Online
Classification: Red Hat
Component: Containers (Show other bugs)
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Michal Fojtik
libra bugs
Depends On:
  Show dependency treegraph
Reported: 2013-08-20 17:44 EDT by Kenny Woodson
Modified: 2015-05-14 19:26 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2013-09-19 12:47:32 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Kenny Woodson 2013-08-20 17:44:30 EDT
Description of problem:

Lately the importance of watchman has increased due to its handling of throttling.  Ops has added monitoring for this service and therefore we have seen, in one week, 750 restarts for the watchman service.

This is what we see in the logs:
Aug 20 16:14:40 ex-std-node148 rhc-watchman[31800]: Running rhc-watchman => delay: 60s, exception threshold: 5
Aug 20 16:14:41 ex-std-node148 rhc-watchman[31800]: watchman caught #<ArgumentError: invalid date>: invalid date. Retries left: 4

It appears there is an invalid date that is being parsed over and over which causes watchman to fail after 10 exceptions.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Create a date that is unreadable for rhc-watchman
2. View the exception/retry loop in /var/log/messages

Actual results:
rhc-watchman dies after 10 exceptions are caught.

Expected results:
rhc-watchman should attempt to parse the incorrect date.  If it fails after a number of attempts it should correct the date.  If it is unable to do so, print an exception to syslog and continue processing.

Additional info:

This is happening on a high number of our servers.  This needs to be repaired/fixed/patched.
Comment 1 Kenny Woodson 2013-08-21 10:04:39 EDT
Here is the specific line that is killing the parser:
2013/02013/05/13 12:24:45,313 ERROR [stderr] (Thread-0 (HornetQ-scheduled-threads-33041257)) Exception in thread "Thread-0 (HornetQ-scheduled-threads-33041257)" java.lang.OutOfMemoryError: unable to create new native thread

This happens on the strptime call:
File.open(server_log).grep(/ java.lang.OutOfMemoryError/) {|msg|
      Syslog.info("MSG => #{msg}") #DEBUG
      # timezones are just a PITA. server.log message doesn't include timezone or date so inject both from today
      ts = DateTime.strptime(msg, '%Y/%m/%d %T')
      timestamp = DateTime.civil(ts.year, ts.month, ts.day, ts.hour, ts.min, ts.sec, epoch.zone)
      next if epoch > timestamp.

      cache_incident(uuid, timestamp, app_env)
Comment 2 Kenny Woodson 2013-08-21 10:27:22 EDT
The file that is causing the problem:

Comment 3 Michal Fojtik 2013-08-21 10:39:49 EDT
I have fix which make watchman ignore the invalid timestamps with a warning dumped to syslog:

Comment 4 Kenny Woodson 2013-08-30 11:03:47 EDT
Aug 30 10:11:44 ex-std-node74 rhc-watchman[15274]: watchman caught #<ArgumentError: invalid date>: invalid date. Retries left: 0

We are still seeing this issue in production on a few of our nodes.  I added some debugging inside of watchman to see where it was failing.  I see its failing to parse the java server.log

Aug 30 10:51:54 ex-std-node148 rhc-watchman[25453]: PROBLEM PARSING: /var/lib/openshift/518809f74382ec8dfb000015/jbossas//logs/server.log 

Here is the actual line it is failing on.

2013/02013/05/13 12:24:45,313 ERROR [stderr] (Thread-0 (HornetQ-scheduled-threads-33041257)) Exception in thread "Thread-0 (HornetQ-scheduled-threads-33041257)" java.lang.OutOfMemoryError: unable to create new native thread

./rhc-watchman:156:in `strptime'
Comment 5 openshift-github-bot 2013-08-30 21:42:14 EDT
Commit pushed to master at https://github.com/openshift/li

Bug 999183 - Make rhc-watchman ignore messages with invalid timestamp
Comment 6 Meng Bo 2013-09-05 06:14:54 EDT
Checked on devenv_3745,

With about 50 jbossas apps on node.
Modify the timestamp for the server.log to some invalid value. Like the comment#4.
Restart the gear.

Watch the /var/log/messages
There is no date related error, and the rhc-watchman did not restart.

Move bug to verified.

Note You need to log in before you can comment on or make changes to this bug.