Bug 1491405 - Fluentd logs filled with long lines of backslashes and undefined method utc errors after updating docker and using json-file as log driver
Summary: Fluentd logs filled with long lines of backslashes and undefined method utc e...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.6.1
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 3.7.0
Assignee: Rich Megginson
QA Contact: Walid A.
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-13 17:16 UTC by Walid A.
Modified: 2017-11-28 22:10 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The json-file parser was assuming the "time" field was a Time object instead of a String object, which does not have a "utc" method. Consequence: The logs fill with the error. Fix: Check the type of object for the "time" field, and convert the String to a Time object if necessary. Result: json-file read time values are parsed correctly, no errors.
Clone Of:
Environment:
Last Closed: 2017-11-28 22:10:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Walid A. 2017-09-13 17:16:34 UTC
Description of problem:

This is seen on an AWS OCP 3.6 cluster.  While trying to test single fluentd non-mux logging throughput with json-file as log driver, we hit an issue described in BZ https://bugzilla.redhat.com/show_bug.cgi?id=1478821, and no pod log messages were getting indexed in elasticsearch.  When applying workaround for BZ 1478821, removing "/var/lib/docker/containers/*" upgrading docker to docker-2:1.12.6-52.gitc4618fb, and redeploying the logging stack with byo/openshift-cluster/openshift-logging.yml playbook, we saw node fluentd pod logs were filled with longlines of backslashes and errors:

\\\"message\\\"=>\\\"2017-09-13 16:13:25 +0000 [warn]: dump an error event: error_class=NoMethodError error=\\\\\\\"undefined method `utc' for \\\\\\\\\\\\\\\"2017-09-13T16:13:19.521646195Z\\\\\\\\\\\\\\\":String\\\\\\\" tag=\\\\\\\"kubernetes.var.log.containers.logging-fluentd-t4wll_logging_fluentd-elasticsearch-8e9f0750c7b6331028cc8cc91094111f89b10d34dcb01e29c22e96938f1b2c09.log\\\\\\\" time=1505319199 record={\\\\\\\"log\\\\\\\"=>\\\\\\\"\\\\\\\\ ....


Version-Release number of selected component (if applicable):
logging images: v3.6.173.0.32

# openshift version
openshift v3.6.173.0.27
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

docker package version:  docker-1.12.6-55.gitc4618fb.el7.x86_64

How reproducible: always


Edit /etc/sysconfig/docker to use json-file as log driver

Steps to Reproduce:
1. Create an OCP 3.6.173.0.x cluster of 1 master-etcd, 1 infra node, 2 compute node, m4.xlarge
2.  Edit /etc/sysconfig/docker to use json-file as log driver on all the nodes:
# /etc/sysconfig/docker

# Modify these options if you want to change the way the docker daemon runs
## OPTIONS=' --selinux-enabled  --log-driver=journald'
OPTIONS=' --selinux-enabled  --log-driver=json-file --log-opt max-size=100M --log-opt max-file=50'
if [ -z "${DOCKER_CERT_PATH}" ]; then
    DOCKER_CERT_PATH=/etc/docker
fi

# Do not add registries in this file anymore. Use /etc/containers/registries.conf
# from the atomic-registries package.
#

# docker-latest daemon can be used by starting the docker-latest unitfile.
# To use docker-latest client, uncomment below lines
#DOCKERBINARY=/usr/bin/docker-latest
#DOCKERDBINARY=/usr/bin/dockerd-latest
#DOCKER_CONTAINERD_BINARY=/usr/bin/docker-containerd-latest
#DOCKER_CONTAINERD_SHIM_BINARY=/usr/bin/docker-containerd-shim-latest
ADD_REGISTRY='--add-registry registry.ops.openshift.com --add-registry registry.access.redhat.com'
INSECURE_REGISTRY='--insecure-registry registry.ops.openshift.com'

3. restart docker and atomic-openshift-node
4. update docker to package version docker-1.12.6-55.gitc4618fb.el7.x86_64
5. restart docker and atomic-openshift-node
6. delete existing logging project
7. re-deploy logging stack with inventory (attached) with byo/openshift-cluster/openshift-logging.yml playbook
7. run logtest tool (tool used by QE:  https://github.com/openshift/svt/blob/master/openshift_scalability/content/logtest/ocp_logtest-README.md) to generate logs from centos pod on compute node
8. check fluentd logs running on compute nodes with oc logs

Links:
inventory used to deploy logging :  https://gist.github.com/wabouhamad/c96f9b243b4f03261295a133cd6feb5c

deployment config for elasticsearch: https://gist.github.com/wabouhamad/2b837b6046c2356b75baf604c16433af

daemonset config for logging-fluentd:
https://gist.github.com/wabouhamad/66d7845a2cdf18242097c2fd853477c6

Ansible playbook to tune journald: https://gist.github.com/wabouhamad/e0c2cb8937ecbf15870b34f3d669a8f3

Actual results:
fluentd pod logs were filled with longlines of backslashes and errors:

\\\"message\\\"=>\\\"2017-09-13 16:13:25 +0000 [warn]: dump an error event: error_class=NoMethodError error=\\\\\\\"undefined method `utc'

Expected results:
No errors in logs and pod logs from logtest tool indexed in elasticsearch

Additional info:
Journald was tuned with this playbook:
https://gist.github.com/wabouhamad/e0c2cb8937ecbf15870b34f3d669a8f3

Comment 2 Rich Megginson 2017-09-13 17:28:50 UTC
fixed in https://github.com/openshift/origin-aggregated-logging/pull/657

Comment 3 Rich Megginson 2017-09-14 21:05:18 UTC
Fixed for 3.6.2:

koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=596941
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.6-rhel-7-docker-candidate-10286-20170914205115
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.33
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.33-2
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6

Comment 4 Walid A. 2017-09-16 03:47:08 UTC
Verified on AES EC2 OCP cluster:
# openshift version
openshift v3.6.173.0.30
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

openshift_logging_image_version=v3.6.173.0.32

logging-fluentd image v3.6.173.0.33-2

Ran logging test at 750 messages per seconds from a single pod and verified messages were indexed successfully in Elasticsearch.

No errors in fluentd or elasticsearch pods logs.

No rejected bulk queue errors during test

# oc exec $POD -n logging -- curl --connect-timeout 2 -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/_cat/indices?v
health status index                                                                      pri rep docs.count docs.deleted store.size pri.store.size 
green  open   .searchguard.logging-es-data-master-ok4s3o6y                                 1   0          5            0     30.4kb         30.4kb 
green  open   project.logging.0d3b9ba5-9a7a-11e7-b958-020518006434.2017.09.16              1   0        949            0    420.4kb        420.4kb 
green  open   project.lograte-750mpsec-0.33590427-9a89-11e7-b1fa-020518006434.2017.09.16   1   0     337125            0    137.9mb        137.9mb 
green  open   project.lograte-100mpsec-0.eb21deed-9a87-11e7-b1fa-020518006434.2017.09.16   1   0      28000            0     11.5mb         11.5mb 
green  open   .kibana                                                                      1   0          1            0      3.1kb          3.1kb 
green  open   .operations.2017.09.16                                                       1   0          4            0     17.6kb         17.6kb 


# oc exec -n logging $POD --   curl -s -k --cert /etc/elasticsearch/secret/admin-cert   --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/thread_pool?v\&h=host,bulk.completed,bulk.rejected,bulk.queue,bulk.active,bulk.queueSize
host       bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 
172.20.0.7            917             0          0           0            200

Comment 8 errata-xmlrpc 2017-11-28 22:10:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.