At times, logging from a particular namespace seems to have been stopped from the Kibana view of logs. This can occur when a namespace's pods running on one or more nodes in the cluster are not being indexed into Elasticsearch by that node's fluentd process. The fluentd process can get into this state when every attempt to write logs to an Elasticsearch instance takes longer than 5 seconds to complete. Until a consistent number of writes takes less than 5 seconds to complete, logging essentially stops working. This can lead to log loss, when containers come and go and fluentd fills up its internal queue and is then unable to read logs from the new containers. The fix requires a change to the "request_timeout" parameter of the elasticsearch output plugin to set its value to some very high value (600 seconds, or 10 minutes, should be sufficient for most purposes), and the use of "buffer_queue_full_action" set to "block" to prevent the further records being read from log files and subsequently dropped on the floor. Further, the use of a "flush_interval" of 1 second serves to ensure writes flow with minimal delay to Elasticsearch, keeping the size of the writes smaller from all nodes in the cluster, allowing for more overlap between log file reading and Elasticsearch writes.
See also PR https://github.com/openshift/origin-aggregated-logging/pull/698 for a proposed set of changes upstream.
Closing in favor of the referenced trello card
Why close this bug when we need a BZ to file this against 3.6.z, no?
Created 6 projects to populate logs and let it run for 3 hours, no exception was found, project logs could be found in kibana UI # openshift version openshift v3.6.173.0.49 kubernetes v1.6.1+5115d708d7 etcd 3.2.1 Images: logging-auth-proxy-v3.6.173.0.49-4 logging-curator-v3.6.173.0.49-4 logging-elasticsearch-v3.6.173.0.49-5 logging-fluentd-v3.6.173.0.49-4 logging-kibana-v3.6.173.0.49-5 # rpm -qa | grep openshift-ansible openshift-ansible-3.6.173.0.48-1.git.0.1609d30.el7.noarch openshift-ansible-roles-3.6.173.0.48-1.git.0.1609d30.el7.noarch openshift-ansible-docs-3.6.173.0.48-1.git.0.1609d30.el7.noarch openshift-ansible-lookup-plugins-3.6.173.0.48-1.git.0.1609d30.el7.noarch openshift-ansible-callback-plugins-3.6.173.0.48-1.git.0.1609d30.el7.noarch openshift-ansible-playbooks-3.6.173.0.48-1.git.0.1609d30.el7.noarch openshift-ansible-filter-plugins-3.6.173.0.48-1.git.0.1609d30.el7.noarch
Performance testing also pass, no log loss.
QE was able to run 750 - 1000 1K messages/sec through fluentd with no message loss using the default settings. 750 - 1000 1K messages per second per fluentd is the current maximum we've found in testing.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3389