Description of problem: This is seen on an AWS OCP 3.6 cluster. While trying to test single fluentd non-mux logging throughput with json-file as log driver, we hit an issue described in BZ https://bugzilla.redhat.com/show_bug.cgi?id=1478821, and no pod log messages were getting indexed in elasticsearch. When applying workaround for BZ 1478821, removing "/var/lib/docker/containers/*" upgrading docker to docker-2:1.12.6-52.gitc4618fb, and redeploying the logging stack with byo/openshift-cluster/openshift-logging.yml playbook, we saw node fluentd pod logs were filled with longlines of backslashes and errors: \\\"message\\\"=>\\\"2017-09-13 16:13:25 +0000 [warn]: dump an error event: error_class=NoMethodError error=\\\\\\\"undefined method `utc' for \\\\\\\\\\\\\\\"2017-09-13T16:13:19.521646195Z\\\\\\\\\\\\\\\":String\\\\\\\" tag=\\\\\\\"kubernetes.var.log.containers.logging-fluentd-t4wll_logging_fluentd-elasticsearch-8e9f0750c7b6331028cc8cc91094111f89b10d34dcb01e29c22e96938f1b2c09.log\\\\\\\" time=1505319199 record={\\\\\\\"log\\\\\\\"=>\\\\\\\"\\\\\\\\ .... Version-Release number of selected component (if applicable): logging images: v3.6.173.0.32 # openshift version openshift v3.6.173.0.27 kubernetes v1.6.1+5115d708d7 etcd 3.2.1 docker package version: docker-1.12.6-55.gitc4618fb.el7.x86_64 How reproducible: always Edit /etc/sysconfig/docker to use json-file as log driver Steps to Reproduce: 1. Create an OCP 3.6.173.0.x cluster of 1 master-etcd, 1 infra node, 2 compute node, m4.xlarge 2. Edit /etc/sysconfig/docker to use json-file as log driver on all the nodes: # /etc/sysconfig/docker # Modify these options if you want to change the way the docker daemon runs ## OPTIONS=' --selinux-enabled --log-driver=journald' OPTIONS=' --selinux-enabled --log-driver=json-file --log-opt max-size=100M --log-opt max-file=50' if [ -z "${DOCKER_CERT_PATH}" ]; then DOCKER_CERT_PATH=/etc/docker fi # Do not add registries in this file anymore. Use /etc/containers/registries.conf # from the atomic-registries package. # # docker-latest daemon can be used by starting the docker-latest unitfile. # To use docker-latest client, uncomment below lines #DOCKERBINARY=/usr/bin/docker-latest #DOCKERDBINARY=/usr/bin/dockerd-latest #DOCKER_CONTAINERD_BINARY=/usr/bin/docker-containerd-latest #DOCKER_CONTAINERD_SHIM_BINARY=/usr/bin/docker-containerd-shim-latest ADD_REGISTRY='--add-registry registry.ops.openshift.com --add-registry registry.access.redhat.com' INSECURE_REGISTRY='--insecure-registry registry.ops.openshift.com' 3. restart docker and atomic-openshift-node 4. update docker to package version docker-1.12.6-55.gitc4618fb.el7.x86_64 5. restart docker and atomic-openshift-node 6. delete existing logging project 7. re-deploy logging stack with inventory (attached) with byo/openshift-cluster/openshift-logging.yml playbook 7. run logtest tool (tool used by QE: https://github.com/openshift/svt/blob/master/openshift_scalability/content/logtest/ocp_logtest-README.md) to generate logs from centos pod on compute node 8. check fluentd logs running on compute nodes with oc logs Links: inventory used to deploy logging : https://gist.github.com/wabouhamad/c96f9b243b4f03261295a133cd6feb5c deployment config for elasticsearch: https://gist.github.com/wabouhamad/2b837b6046c2356b75baf604c16433af daemonset config for logging-fluentd: https://gist.github.com/wabouhamad/66d7845a2cdf18242097c2fd853477c6 Ansible playbook to tune journald: https://gist.github.com/wabouhamad/e0c2cb8937ecbf15870b34f3d669a8f3 Actual results: fluentd pod logs were filled with longlines of backslashes and errors: \\\"message\\\"=>\\\"2017-09-13 16:13:25 +0000 [warn]: dump an error event: error_class=NoMethodError error=\\\\\\\"undefined method `utc' Expected results: No errors in logs and pod logs from logtest tool indexed in elasticsearch Additional info: Journald was tuned with this playbook: https://gist.github.com/wabouhamad/e0c2cb8937ecbf15870b34f3d669a8f3
fixed in https://github.com/openshift/origin-aggregated-logging/pull/657
Fixed for 3.6.2: koji_builds: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=596941 repositories: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.6-rhel-7-docker-candidate-10286-20170914205115 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.33 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.33-2 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6
Verified on AES EC2 OCP cluster: # openshift version openshift v3.6.173.0.30 kubernetes v1.6.1+5115d708d7 etcd 3.2.1 openshift_logging_image_version=v3.6.173.0.32 logging-fluentd image v3.6.173.0.33-2 Ran logging test at 750 messages per seconds from a single pod and verified messages were indexed successfully in Elasticsearch. No errors in fluentd or elasticsearch pods logs. No rejected bulk queue errors during test # oc exec $POD -n logging -- curl --connect-timeout 2 -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/_cat/indices?v health status index pri rep docs.count docs.deleted store.size pri.store.size green open .searchguard.logging-es-data-master-ok4s3o6y 1 0 5 0 30.4kb 30.4kb green open project.logging.0d3b9ba5-9a7a-11e7-b958-020518006434.2017.09.16 1 0 949 0 420.4kb 420.4kb green open project.lograte-750mpsec-0.33590427-9a89-11e7-b1fa-020518006434.2017.09.16 1 0 337125 0 137.9mb 137.9mb green open project.lograte-100mpsec-0.eb21deed-9a87-11e7-b1fa-020518006434.2017.09.16 1 0 28000 0 11.5mb 11.5mb green open .kibana 1 0 1 0 3.1kb 3.1kb green open .operations.2017.09.16 1 0 4 0 17.6kb 17.6kb # oc exec -n logging $POD -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/thread_pool?v\&h=host,bulk.completed,bulk.rejected,bulk.queue,bulk.active,bulk.queueSize host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 172.20.0.7 917 0 0 0 200
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188