Created attachment 1352321 [details] ES and fluentd logs from anomalous runs Description of problem: Somewhere between logging-fluentd 3.7.0-0.158.0 (OCP 3.7 scale lab version) and 3.7.7 (SVT regression test version) there has been a regression in the performance of logging-fluentd in a non-mux configuration. I have not tried to bisect it down to the exact build. fluentd CPU % numbers below are out of 1 core. 3.7.0-0.158.0: 250 1K messages/second per fluentd fluentd CPU avg = 24% Message loss = 0% 500 1K messages/second per fluentd fluentd CPU avg = 46% Message loss = 3% 750 1K messages/second per fluentd fluentd CPU avg = 68% Message loss = 5% No errors in logging-fluentd pod logs. 3.7.7: 250 1K messages/second per fluentd fluentd CPU avg = 38% Message loss = 4% 500 1K messages/second per fluentd fluentd CPU avg = 73% Message loss = 4% 750 1K messages/second per fluentd fluentd CPU avg = 88% Message loss = 13% No error messages in the logging-fluentd log and no retry messages. ES and fluentd logs will be attached for the runs with higher CPU/higher message loss. Repeated this twice on each build with very consistent results. The message loss is possibly related to re-opened bz https://bugzilla.redhat.com/show_bug.cgi?id=1486473 which also had message loss. Version-Release number of selected component (if applicable): logging-fluentd 3.7.7 How reproducible: Always Steps to Reproduce: 1. Deploy logging-fluentd 3.7.0-0.158.0 2. Run a message rate of 500 messages per second through the logging-fluentd node. Record fluentd cpu usage with pidstat -C fluentd 30. Average the samples 3. Edit the logging-fluentd daemonset and update the logging-fluentd image to 3.7.7 and re-start the fluentd pods 4. Re-run the test Actual results: 29% - 58% higher cpu for the 3.7.7 build + higher message loss. The message loss itself seems to be a regression from even earlier builds as experiments were done in 3.6.z and early 3.7 where there was no message loss at single fluentd rates in excess of 750 1K messages per second per fluentd. Expected results: No message loss. CPU utilization which allows 750 - 1000 1K messages per second per fluentd. No performance regression Additional info: OOTB memory and cpu requests/limits in effect. [OSEv3:children] masters etcd [masters] ec2-54-190-57-184.us-west-2.compute.amazonaws.com [etcd] ec2-54-190-57-184.us-west-2.compute.amazonaws.com [OSEv3:vars] openshift_deployment_type=openshift-enterprise openshift_release=v3.7.0 openshift_logging_install_logging=true openshift_logging_master_url=https://ec2-54-190-57-184.us-west-2.compute.amazonaws.com:8443 openshift_logging_master_public_url=https://ec2-54-190-57-184.us-west-2.compute.amazonaws.com:8443 openshift_logging_kibana_hostname=kibana.1107-6c3.qe.rhcloud.com openshift_logging_namespace=logging openshift_logging_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ openshift_logging_image_version=v3.7.0-0.158.0 openshift_logging_es_cluster_size=1 openshift_logging_es_pvc_dynamic=true openshift_logging_es_pvc_size=50Gi openshift_logging_fluentd_use_journal=false openshift_logging_use_mux=false openshift_logging_use_ops=false
What do these say? oc exec -c elasticsearch $espod -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/indices?v oc exec -c elasticsearch $espod -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/thread_pool?v
root@ip-172-31-52-115: ~ # oc exec -c elasticsearch $espod -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/indices?v health status index pri rep docs.count docs.deleted store.size pri.store.size green open project.logging.258aabdb-c87a-11e7-9459-0210c4607772.2017.11.15 1 0 2388 0 1mb 1mb green open .operations.2017.11.15 1 0 17027 0 10.9mb 10.9mb green open .searchguard.logging-es-data-master-t1b082or 1 0 5 0 33.5kb 33.5kb green open project.mff0.24aa2e93-c9a6-11e7-9459-0210c4607772.2017.11.15 1 0 595099 0 240.5mb 240.5mb green open .kibana 1 0 1 0 3.1kb 3.1kb green open .operations.2017.11.14 1 0 59500 0 25.3mb 25.3mb project.mff0 should have 675000 documents root@ip-172-31-52-115: ~ # oc exec -c elasticsearch $espod -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/thread_pool?v host ip bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected 172.23.0.46 172.23.0.46 0 0 0 0 0 0 0 0 0 bulk.rejected never has a value and bulk.active/bulk.queue never go above 1 during periodic sampling.
There are three main changes: 1) docker audit logs This should only be enabled if you set env AUDIT_CONTAINER_ENGINE=true oc exec $fluentd_pod -- ls -alrtF /etc/fluent/configs.d/openshift If you have files input-pre-audit-log.conf and filter-pre-a-audit-exclude.conf that are not 0 length, then that is a bug. If this is disabled, I would not expect much cpu or message loss.
2) configs.d/openshift/filter-pre-force-utf8.conf +<filter **> + @type record_modifier + char_encoding utf-8 +</filter> This affects every field/record. This may contribute to cpu but I would not expect record loss
3) the k8s meta cache and watches tuning <filter kubernetes.**> type kubernetes_metadata kubernetes_url "#{ENV['K8S_HOST_URL']}" + cache_size "#{ENV['K8S_METADATA_CACHE_SIZE'] || '1000'}" + watch "#{ENV['K8S_METADATA_WATCH'] || 'false'}" If there are a lot of pods/namespaces used during the test, this may have a significant cpu impact and may lead to message "loss" (but you should see the message about "missing kubernetes field - cannot construct index name, and the records should go into the fluentd index or possibly be rerouted to some other index).
Are you sure that those missing records didn't end up in the .operations.* indices? If not, and there are no fluentd errors, I'm at a loss to explain what happened to the missing records.
rhpkg co logging-fluentd-docker ; cd logging-fluentd-docker git diff ba68308a0e45eac247e7b70be6b7c9673c431b80..aa03149eab5388827ee82b7c757fc43a9edbab98
Perhaps the records were buffered and not sent to elasticsearch? ls -alrtF /var/lib/fluentd to list the buffers There is a tool called msgpack-inspect which will dump those buffers: gem install msgpack-inspect msgpack-inspect /var/lib/fluentd/some-file.log
(In reply to Rich Megginson from comment #8) > There is a tool called msgpack-inspect which will dump those buffers: > > gem install msgpack-inspect > > msgpack-inspect /var/lib/fluentd/some-file.log probably easier to do this inside the fluentd pod if still running
In response to comment 6 and comment 8 : Test was repeated, 675,000 messages were logged from the logtest pod, only 612,872 documents made it into elastic search for the project index "project.walid-0". 62,128 documents are missing. There were no errors in fluentd or elasticsearch pod logs. The .operations.<today's date> index was deleted before the test. After the test, there were only 4291 documents in .operations.2017.11.15. root@ip-172-31-52-115: # oc exec -n logging $espod -c elasticsearch -- curl --connect-timeout 2 -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/_cat/indices?v health status index pri rep docs.count docs.deleted store.size pri.store.size green open project.logging.258aabdb-c87a-11e7-9459-0210c4607772.2017.11.15 1 0 12628 0 3.6mb 3.6mb green open .operations.2017.11.15 1 0 4291 0 2.1mb 2.1mb green open .searchguard.logging-es-data-master-t1b082or 1 0 5 0 33.5kb 33.5kb green open .kibana 1 0 1 0 3.1kb 3.1kb green open .operations.2017.11.14 1 0 59500 0 25.3mb 25.3mb green open project.walid-0.fd037d94-ca28-11e7-9459-0210c4607772.2017.11.15 1 0 612872 0 248.1mb 248.1mb green open project.mff0.24aa2e93-c9a6-11e7-9459-0210c4607772.2017.11.15 1 0 595099 0 240.5mb 240.5mb root@ip-172-31-52-115: ~/svt/openshift_scalability # In the fluentd pod, in /var/lib/fluentd, there were two transient buffer files every sec or during the logging test. No buffer files were left in that directory after the test completed. Sample transient files: total 3892 drwxr-xr-x. 1 root root 35 Nov 15 01:39 ../ -rw-r--r--. 1 root root 1952309 Nov 15 17:33 buffer-output-es-config.output_tag.q55e08e7c50b99100.log drwxr-xr-x. 2 root root 134 Nov 15 17:33 ./ -rw-r--r--. 1 root root 1954255 Nov 15 17:33 buffer-output-es-config.output_tag.b55e08e7da298a677.log bulk.rejected was 0: oc exec -c elasticsearch $espod -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/thread_pool?v host ip bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected 172.23.0.46 172.23.0.46
For the performance, the primary culprit seems to be the utf8 record modifier. With it in the pipeline, the performance is twice as slow.
This is just for the performance part, and may not address all of the performance issues. This does not address the record loss issue.
Commits pushed to master at https://github.com/openshift/origin-aggregated-logging https://github.com/openshift/origin-aggregated-logging/commit/6b74f2f9c16dec7ee90681ee57234f696d2b553a Bug 1513254 - logging-fluentd performance regression between 3.7.0-0.158.0 and 3.7.7 + message loss. https://bugzilla.redhat.com/show_bug.cgi?id=1513254 The utf8 filter causes the peformance to be cut in half. Disable it by default. If you need to use it, and can pay the performance penalty, oc set env ds/logging-fluentd ENABLE_UTF8_FILTER=true https://github.com/openshift/origin-aggregated-logging/commit/9f87406267ce7e7b2eca6b00d90c9996dc65267c Merge pull request #787 from richm/bug-1513254 Automatic merge from submit-queue. Bug 1513254 - logging-fluentd performance regression between 3.7.0-0.158.0 and 3.7.7 + message loss. https://bugzilla.redhat.com/show_bug.cgi?id=1513254 The utf8 filter causes the peformance to be cut in half. Disable it by default. If you need to use it, and can pay the performance penalty, oc set env ds/logging-fluentd ENABLE_UTF8_FILTER=true see also https://github.com/openshift/origin-aggregated-logging/pull/622 /test
The latest v3.7 logging-fluentd image does not contain the fix. I checked run.sh and the code in https://github.com/openshift/origin-aggregated-logging/pull/806/files is not present. Moving this defect back to MODIFIED until a build with the fix is available. We're testing with 3.7.14-2: registry.reg-aws.openshift.com:443/openshift3/logging-fluentd v3.7.14-2 772622353a46 5 days ago 235.5 MB 5 days ago is as of 13 December. Since PR #806 merged on 30 November, maybe it is a build issue?
Retested with logging-fluentd:v3.7.16-1, with same out-of-the box memory and cpu requests/limits for logging-fluentd and ES. Used the same docker json-file log driver configuration of max-file 3 and max-size 10 MB. The utf8-filter should be disabled by default. I also confirmed the run.sh script in fluentd pods contains the code update from PR https://github.com/openshift/origin-aggregated-logging/pull/787/files, and oc exec $fluentd_pod -- ls -alrtF /etc/fluent/configs.d/openshift showing a file size of 0 for filter-pre-force-utf8.conf. When the same logging tests at 250 and 500 messages per second were repeated, message loss in Elasticsearch was still observed and at the same percentage. No improvement in ES message loss compared to the previously reported results with the utf8-filter enabled. No errors in fluentd or ES pod logs were observed. ES bulk.rejected was always at 0, and bulk.active/bulk.queue never went above 1 during the test. The only improvement observed was in fluentd CPU utilization on the compute nodes: - 250 1 KB messages per second test: 666451 documents were indexed for pod project in ES out of the 675000 messages logged. 8549 messages were lost (1.3%). fluentd CPU avg = 25%. Compared with previous tests with utf8-filter enabled, we had 8891 messages lost (1.4%) in ES and fluentd CPU avg at 34% of 1 core. - 500 1 KB messages per second test: 657909 documents were indexed for pod project in ES out of the 675000 messages logged. 17091 messages were lost (2.5%). fluentd CPU avg = 49%. Compared with previous tests with utf8-filter enabled, we had 17774 messages lost (2.6%) in ES and fluentd CPU avg at 76% of 1 core. Another interesting observation was when the tests were repeated, exactly the same number of messages lost was observed on the subsequent runs. Finally, when I increased the fluentd ds config variables BUFFER_QUEUE_LIMIT from 32 to 1024, BUFFER_SIZE_LIMIT from 8m to 16m, and FILE_BUFFER_LIMIT from 256Mi to 1Gi, I observed NO message lost on ES and all 6750000 documents were indexed at 500 messages per second logging rate with utf8-filter disabled. This suggests the message loss in ES is mainly related to fluentd buffering capacity/performance. Moving back to Assigned state.
So we still have a regression from the baseline 3.7 release. The fix in https://bugzilla.redhat.com/show_bug.cgi?id=1513254#c20 wasn't meant to fix all of the problems reported in this bz, but it appears to have helped. We have "fixed" the utf8 filter https://bugzilla.redhat.com/show_bug.cgi?id=1513254#c4 I'm pretty sure it isn't related to the audit log: https://bugzilla.redhat.com/show_bug.cgi?id=1513254#c3 That leaves the following: - k8s filter changes: https://bugzilla.redhat.com/show_bug.cgi?id=1513254#c5 - something else - not sure what that would be - perhaps changes in fluentd itself, some dependency, etc.
we are now consistently seeing NO message loss in Elasticsearch at 500 messages per sec and even 750 messages per second logging rates with the following settings: - utf8-filter disbaled (default) - out-of-the-box logging settings - docker on compute nodes configured with "--log-opt max-size from 100M (instead of 10M) for the json-file, and using the same value of max-file=3 This was confirmed on two separate OCP 3.7 clusters (openshift version 3.7.14-1 and 3.7.18-1) and running logging images v3.7.16-1. In each test, 1500000 1K messages were logged from a logtest pod at rates 750 messages per second. No loss was observed. fluentd CPU Utilization averaged around 72% when logging at 750 messages per sec. It is clear that the size of the json-file logs files has a significant impact on ES message loss when pods are logging high log rates of 300 1K messages per sec or higher. This was also mentioned in private comment comment 16.
@Mike @Walid - what do we need to do to close this bug?
I think we should do one more test pass on the latest 3.7 errata and document the current performance. QE SVT will queue that up. You can move the bz to ON_QA if there are no further planned code changes targeted for this bz.
Disabling the utf8 filter (by default) addressed this issue. Verified on 3.7.44