Bug 1833486
Summary: | Logging performance degraded compared to 4.4 | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Eric Matysek <ematysek> | ||||||
Component: | Logging | Assignee: | Jeff Cantrill <jcantril> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Eric Matysek <ematysek> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 4.5 | CC: | anli, aos-bugs, dblack, jcantril, lvlcek, mifiedle, mkarg, mrobson, periklis, yjoseph | ||||||
Target Milestone: | --- | Keywords: | Regression, Reopened | ||||||
Target Release: | 4.6.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1877818 (view as bug list) | Environment: | |||||||
Last Closed: | 2020-10-27 15:58:59 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1847027, 1877818 | ||||||||
Attachments: |
|
Description
Eric Matysek
2020-05-08 17:58:57 UTC
Please post elasticsearch and fluent logs for review Created attachment 1687396 [details]
Elasticsearch and Fluentd Logs
Attached logs as requested.
I generated 600k logs for this cluster at 500/s and only got 370k successfully indexed.
$ python verify_logtest_index.py -i 'app*' --stream -m 600000
Index document count: 370000
Missing log line(s): 1-230000 (230000)
No duplicates found!
Number of missing logs: 230000
38.3333% message loss rate
Maybe caused by https://bugzilla.redhat.com/show_bug.cgi?id=1834558 The fix for bug 1834558 did significantly improve performance, however I am still seeing a decrease in throughput compared to 4.4 for single pod logging rates. 4.4 we were able to reliable log 2.5k msg/s, even seeing some passes at 3k msg/s passing. On 4.5 2k seems reliable but I have not been able to get a single pass at 2.5k msg/s I'm having difficulties standing up a cluster to verify, but please try the following to bypass the proxy to elasticsearch so we may determine how it is affecting performance. 1. oc edit clusterlogging instance (set ManagementState: Unmanaged) 2. oc edit elasticsearch elasticsearch (set ManagementState: Unmanaged) 3. oc edit configmap elasticsearch (modify network be like: https://github.com/openshift/elasticsearch-operator/blob/release-4.2/pkg/k8shandler/configuration_tmpl.go#L17-L18) 4. delete elasticsearch pods to force them to restart in order to load the config 5. oc edit service elasticsearch (Modify "targetPort" to be 9200 instead of 'restapi') Fluent should already be able to write to ES using its certificates and this should modify the service to direct traffic straight to the ES container and bypass the proxy @ematysek Have you tried the above comments from https://bugzilla.redhat.com/show_bug.cgi?id=1833486#c5 ? Unfortunately I wasn't able to get logs successfully indexed into elasticsearch after making the changes mentioned by Jeff. Created attachment 1692476 [details]
Fluentd Logs
Fluentd logs after making changes trying to bypass proxy.
Mostly filled with lines like this:
2020-05-26 19:13:45 +0000 [warn]: [clo_default_output_es] failed to flush the buffer. retry_time=2 next_retry_seconds=2020-05-26 19:13:47 +0000 chunk="5a690759010aca78709f913e131436a9" error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure error="buffer is full."
lowering severity:low as this is not a functional blocker to 4.5 (In reply to Eric Matysek from comment #8) > Created attachment 1692476 [details] > Fluentd Logs > > Fluentd logs after making changes trying to bypass proxy. > Mostly filled with lines like this: > 2020-05-26 19:13:45 +0000 [warn]: [clo_default_output_es] failed to flush > the buffer. retry_time=2 next_retry_seconds=2020-05-26 19:13:47 +0000 > chunk="5a690759010aca78709f913e131436a9" > error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure > error="buffer is full." Not certain this is related to the perf issue but definitely seems like it would block forever https://github.com/uken/fluent-plugin-elasticsearch/pull/688 (In reply to Jeff Cantrill from comment #10) > (In reply to Eric Matysek from comment #8) > > Created attachment 1692476 [details] > > Fluentd Logs > > > > Fluentd logs after making changes trying to bypass proxy. > > Mostly filled with lines like this: > > 2020-05-26 19:13:45 +0000 [warn]: [clo_default_output_es] failed to flush > > the buffer. retry_time=2 next_retry_seconds=2020-05-26 19:13:47 +0000 > > chunk="5a690759010aca78709f913e131436a9" > > error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure > > error="buffer is full." > > Not certain this is related to the perf issue but definitely seems like it > would block forever > https://github.com/uken/fluent-plugin-elasticsearch/pull/688 We have 4.0.5 and this was merged earlier so likely not blocking I would be interested in learning how many primary shards and replicas we have for each index in ES in this case. Can you please get indices status? You can use the "indices" utility. https://github.com/openshift/origin-aggregated-logging/blob/master/elasticsearch/utils/indices Given we should have significantly less indices now we need to check if we can get better performance by using more shards. I found a bug. We did not update index name patterns in common.settings.* files in https://github.com/openshift/origin-aggregated-logging/tree/master/elasticsearch/index_templates All of them are still assuming old index naming conventions ("project*",".operations*", ...etc). These files are not part of https://github.com/ViaQ/elasticsearch-templates so they need to be updated separately. That should mean that indices should use the default Elasticsearch sharding (https://www.elastic.co/guide/en/elasticsearch/reference/6.8/indices-create-index.html#create-index-settings) which is 5 shards and 1 replicas. Indices status report should confirm this. Updating to target 4.6 to get these into release and backported to 4.5. requested backport in GH (In reply to Lukas Vlcek from comment #13) > That should mean that indices should use the default Elasticsearch sharding > (https://www.elastic.co/guide/en/elasticsearch/reference/6.8/indices-create- > index.html#create-index-settings) which is 5 shards and 1 replicas. Indices > status report should confirm this. You can see in my original comment that the app* indices had 3 primary shards and 1 replica. I think if the bottleneck was elasticsearch itself I would be seeing some sort of anomaly in elasticsearch pod cpu/memory usage This is a regression and is a loss of pod log messages. Tagging as regression. I think this needs to be targeted for 4.5? Here is an example of random chunks of messages being lost to differentiate between this bug and #1846174 Test setup: 1 project 2.4M lines 2k/s rate $ python verify_logtest_index.py --stream -i app* -m 2400000 Index document count: 330000 Missing log line(s): 273001-1799000 (1526000) Missing log line(s): 1856001-2400000 (544000) No duplicates found! Number of missing logs: 2070000 86.2500% message loss rate Put UpcomingSprint, as ongoing investigation according to [1], PR seems not conclusive yet. [1] https://coreos.slack.com/archives/CB3HXM2QK/p1593547317176100 *** Bug 1846174 has been marked as a duplicate of this bug. *** *** Bug 1844639 has been marked as a duplicate of this bug. *** Rebuilt the 4.4 image to be 4.5 compatible (e.g hostname, updated viaq plugin, wait_for_es_script) and reliably ran "WORKLOAD_DIR=$(pwd) NUM_LINES=1125000 RATE=150000 ansible-playbook -v -i inventory workloads/logging.yml" three times successively without fail. Switched to the 4.5 image and ran same test which passed the first time and has failed 2 subsequent times Unfortunately this bug still exists for me. I've also tried deploying CLO/EO 4.4 on a 4.5 OCP cluster and am able to log at the expected rates so I think that observation confirms this is introduced in CLO/EO rather than something in OCP Confirmed PR 1953 fixes this issue and we are able to log at 2.5k/s consistently I can confirm that logging at 2.5k/s works fine on 4.5.5 but going for higher rates leaves me with messages stuck in the fluentd pods and never getting delivered to ES. I've reran a test with 100 pods in 100 namespaces, logging at 50000 msgs / sec in total for 20 minutes, which should result in 600000 messages in every index. This is what I got: green open project.logtest98.fdfae9ef-a84a-4c7e-8f70-e580ddf60159.2020.08.13 f3roeu5rR1utKqlASrfTlQ 3 1 451750 0 617.1mb 308.6mb green open project.logtest83.bfbfa57a-2794-4450-90f2-7f9d283c5365.2020.08.13 Lozpgwd9SLmfImn4OQW9fg 3 1 103500 0 143.3mb 72.2mb green open project.logtest86.dcc4f112-0328-49e4-b177-97fb2527b5d8.2020.08.13 POLDYc3HTXyBB30PFZ4H8g 3 1 203000 0 277.4mb 138mb green open project.logtest61.cc3784c7-7702-4b64-84bc-bb5b52235677.2020.08.13 2-mxFKBpTdCoa1C_exjQSw 3 1 100250 0 139.4mb 69.8mb green open project.logtest82.abc253a2-965d-4493-bc30-cfda648d0124.2020.08.13 0MXvzY-qTw6A201EdWoaCA 3 1 201500 0 255.4mb 127.6mb green open project.logtest57.9111cd70-40fa-464d-8a9c-1303ee60e31b.2020.08.13 n7wWZum-S6-6EdpAjoPfzw 3 1 151750 0 208.1mb 103.8mb green open project.logtest40.e924f170-4805-4a71-b10a-c17167d56a58.2020.08.13 n7f_RiqPQHO2O_WyALjLxQ 3 1 200750 0 273.7mb 137mb green open project.logtest37.226c2a8f-e2bd-4154-bc07-7d07e0835389.2020.08.13 QRO2_AEzT_WXwZYfh-ElXw 3 1 252750 0 346mb 173mb green open project.logtest5.c9521dac-510d-40a8-b7b6-94d74851d28a.2020.08.13 8jsCUniUSpu8s4JbXYGdVg 3 1 201500 0 277.4mb 138.5mb green open project.logtest34.1c1679fb-ba08-460d-bc81-1103ffbac4d3.2020.08.13 Hqu7M6C9Q5GUu4EpsYs2_g 3 1 251500 0 344.3mb 171.4mb green open project.logtest97.1f516add-6c98-4be1-9c87-3fd35aa1ec9b.2020.08.13 VmNbro9mTAWxBOz-N0PVog 3 1 151750 0 208.5mb 104.2mb green open project.logtest28.0872992b-4ca1-43ba-84de-dea2b1284098.2020.08.13 w5543si2Rv-Z4yWE0IwKnw 3 1 152000 0 209.5mb 104.8mb green open project.logtest71.f293aeec-656f-421d-be27-a806e68b8dea.2020.08.13 PnSpUG2yQWaBNflO1Dx-jA 3 1 352000 0 481.1mb 240.6mb green open project.logtest70.2e02f50f-a5ea-4e77-89ad-d033370e2066.2020.08.13 k8DIiOaxQ7WDL9LaVZzGKw 3 1 100000 0 138.2mb 68.5mb green open project.logtest8.017152b5-d477-4147-8518-f02358fcec21.2020.08.13 AzHHg0LPTAG_nWlYUhRMFg 3 1 16000 0 24.4mb 12.1mb green open project.logtest75.0e3a3449-e98b-4994-959d-500553bff5bc.2020.08.13 LV-IwSTxSl-9vpGOAnWEqA 3 1 100000 0 137.5mb 68.5mb green open project.logtest20.5da98cdb-7f58-40ad-a92d-59d6cb20a4d0.2020.08.13 qjOK8yd0RtinDDi9zGz44w 3 1 103500 0 142.7mb 71.6mb green open project.logtest30.0099fef1-28b3-4902-9e1c-9a23aefd78bc.2020.08.13 0vYDJT90Q7eTg5wFUItXeg 3 1 204000 0 280.2mb 140.5mb green open project.logtest84.e6e79c7f-142a-42c3-a353-f68a6372634b.2020.08.13 81ft2GteTh2i9WphdqlzJQ 3 1 202750 0 277.8mb 138.8mb green open project.logtest63.2a8f44bb-2307-4399-8479-6796c77a8908.2020.08.13 1j9mFcZYRlqsQogrN5N0Ag 3 1 202500 0 278.3mb 138.6mb green open project.logtest87.4019da71-7b97-47dd-b473-5baf49d90f21.2020.08.13 FfCJF2IvSU2BHW1ohTIPLg 3 1 200000 0 273.9mb 137.1mb green open project.logtest22.14853e9d-dd11-4aef-ba4e-3c19395a0d95.2020.08.13 T8aBvfFfStOdkEvczUItpQ 3 1 203000 0 278.7mb 139.2mb green open project.logtest58.230cf278-d81e-4957-981e-0f994dc24c22.2020.08.13 eAKnTToATtOiiF0mTXmRtg 3 1 103750 0 142.2mb 70.9mb green open project.logtest6.12b8e87a-de40-4e5b-a85e-8acbe52519c9.2020.08.13 Gr2GwGXyRpOfxH_VPzvgIQ 3 1 16000 0 24.3mb 12.1mb green open project.logtest91.4b837eda-5c21-4b38-8351-d797afcf2182.2020.08.13 L_K32ayrTJyOl89c8oKKMg 3 1 252500 0 346.5mb 172.9mb green open project.logtest42.38ce51a4-88e2-4e88-a820-62b5b4275c1c.2020.08.13 7KaHvVMhT7G66jkVdPvHPA 3 1 203000 0 277mb 138.5mb green open project.logtest67.95bf0ab6-ac87-4a38-8ffa-e637163c157d.2020.08.13 yFAgph3xR2CnwuKMgv91UQ 3 1 100750 0 139.3mb 69.7mb green open project.logtest12.2eb97788-8aa8-4dfa-a5f7-6b4b9f463c8b.2020.08.13 RnEbfFOkRvWWLnD2cHEYSA 3 1 501000 0 684.3mb 342.2mb green open project.logtest79.d4d778fe-78df-4d2a-a941-1c6f641c2875.2020.08.13 NiDbcNIdQAm_cWEyM2ldsQ 3 1 149750 0 207.6mb 103.9mb green open project.logtest47.600e17ea-c88f-447e-a9f1-1a610c1fa5d4.2020.08.13 8dATJn1AT0e3WMVIWYkQQg 3 1 204000 0 279.8mb 139.8mb green open project.logtest41.572529c5-6cdb-4389-bd2f-7c78c98b3ba9.2020.08.13 lMoEmd_KQaC6_mu0L03fLw 3 1 352750 0 483.4mb 241.6mb green open project.logtest2.f60f1a2c-8c92-4eda-961c-284eff3c862c.2020.08.13 gdcmORREQD6LGcivPOqGIw 3 1 302000 0 413mb 206.2mb green open project.logtest49.bd65fda1-fcda-4811-9921-dbf857c53687.2020.08.13 YhpsDLrSS9KKwQcj4AV9dw 3 1 202805 0 278.4mb 138.9mb green open project.logtest32.bfec867f-d805-48cd-938f-a3b219027850.2020.08.13 P0ja2ulNRQKBri8jPuM04Q 3 1 101000 0 140.2mb 69.9mb green open project.logtest38.fb0a2902-89b4-47f5-a447-e30323ddc4db.2020.08.13 jqwzyEBhQ_S3kpXJechkuA 3 1 252750 0 344.9mb 172.7mb green open project.logtest62.b4cc0582-835e-4a6a-b355-e603590789fd.2020.08.13 mYjNpdrxRxeKdRlTOrE6vw 3 1 202500 0 276.7mb 138.2mb green open project.logtest9.bd3ab32d-a801-4ecf-bb86-080bf26729f8.2020.08.13 YZ9dUEVVTR-TWO4V2fERrg 3 1 252250 0 345.5mb 172.5mb green open project.logtest21.ef9394cb-6ed8-48f1-a0e0-8366c3705fee.2020.08.13 jscgX1nERCKkhxr0Kofjew 3 1 351750 0 482mb 240.9mb green open project.logtest95.3bf7b5ca-7cf6-45a1-b529-476158cc20ea.2020.08.13 cGGkP_U9SIuf1JyeH2jiZA 3 1 252250 0 345.1mb 172.7mb green open project.logtest1.721c4164-e3ec-42ca-a1ee-738abec432d6.2020.08.13 G-11jZRVQkqs270rRy7OmQ 3 1 600000 0 819mb 409.6mb green open project.logtest4.3ee3b7b0-724c-4c23-bbcd-c3fc1bfe266a.2020.08.13 KGEtDaNCRQa9qCeawA0uDw 3 1 501000 0 685.5mb 342.8mb green open project.logtest24.3e88b5a3-3754-49ae-9aa5-1c9f643c255c.2020.08.13 o_CWjoAITdOPis1m_vvQ9w 3 1 600000 0 817.5mb 408.6mb green open project.logtest81.e19926d0-59bb-4850-b469-b775b9dd9661.2020.08.13 VR5A1qAIRoOPZx0eMdGOaw 3 1 256723 0 326.3mb 163.5mb green open project.logtest60.f6a9c1ac-0302-496b-9d14-4de7ef78eeb9.2020.08.13 T-Mfo8uwTD6ctrhcxmo3PA 3 1 202630 0 278.3mb 138.5mb green open project.logtest77.536c1a73-7c38-4825-94a1-f4ed32c65641.2020.08.13 QRRjY5YNTuSltgVM-lMTPg 3 1 102000 0 141.3mb 70.7mb green open project.logtest11.c6482750-a9e8-4dbc-b3f4-1402ce1d401e.2020.08.13 4AiBAfsFTPalXf623vyTIQ 3 1 253750 0 347.5mb 173.7mb green open project.logtest43.3cd984b9-c2da-4bbe-b513-7f821e818ec9.2020.08.13 6WUk-8IbQoCpgNYH0lRfJg 3 1 153750 0 211.4mb 105.8mb green open project.logtest72.ba3aa48b-8358-4a4c-a512-717f236142cb.2020.08.13 wIwwNQ77Qd62RjVQWCWSMA 3 1 251750 0 318.3mb 159mb green open project.logtest52.c88f7265-fa2d-4813-a36a-f8a2054a55f6.2020.08.13 Wfdnm6TyQG2_xjhjm0HqKA 3 1 203000 0 278.7mb 139.6mb green open project.logtest78.bb3de4aa-9f42-4f58-911a-383aa3a47b39.2020.08.13 NvHzfFJ7TTGeq3-xKTu07Q 3 1 152000 0 194.3mb 97.2mb green open project.logtest15.dda34629-0e1d-46ef-8c56-f7d7c4b9c781.2020.08.13 doiW1BI2R3moeky_I1Hw7g 3 1 252750 0 348.1mb 174.1mb green open project.logtest46.38645aac-c60a-4175-95be-b0936e44b63f.2020.08.13 WGhDKeaCSIOLAMKcB8AoYw 3 1 402000 0 548.6mb 274.6mb green open project.logtest55.fbfc4a80-f54f-4946-9897-c0089dda8520.2020.08.13 Gk7nV1vuRSeQPbFunOib_g 3 1 402000 0 550.3mb 275.2mb green open project.logtest54.532a44d6-2841-45f3-91fc-d3ee96033d57.2020.08.13 d5ESPmkeTrufgS222NxY5Q 3 1 151580 0 208.9mb 104.5mb green open project.logtest45.ee24e156-b682-482a-a99f-d5aa2b6447df.2020.08.13 NJE7qfr9QDKZ2-z17CnGuQ 3 1 100500 0 139.4mb 69.5mb green open project.logtest89.347a235a-5682-4bca-8a29-b0de6b18c59d.2020.08.13 gSP26jjpTre5uMWbOT1QhQ 3 1 51000 0 71.2mb 35.7mb green open project.logtest19.bc3dbbd3-e996-4b28-8a38-dbf2a6959929.2020.08.13 i7CpFQ93StGJV_ZNUXfc7Q 3 1 203250 0 279.6mb 139.6mb green open project.logtest25.a47ec1ff-d1ab-4d85-af73-79e6876df681.2020.08.13 bR2ZqQMWQQqrVeSPwVk8hA 3 1 252750 0 345.8mb 173.2mb green open project.logtest73.b0fa5917-6c01-48e6-997b-4b8fec79258d.2020.08.13 9MaOc2B2SFG5akTSzmLXJg 3 1 203000 0 278.2mb 139.3mb green open project.logtest66.5ff3ca9b-f769-41a5-abbf-f649b4053d24.2020.08.13 AxDKp6PuRwaBG3JrC4RnhA 3 1 102000 0 141.1mb 71.1mb green open project.logtest4.4a8cf193-b552-467f-bd39-42e15dea26a0.2020.08.13 h1VuZy2jSDWsiDxgeBTGBA 3 1 15500 0 23.9mb 11.9mb green open project.logtest44.37095fb0-5272-4baf-8a43-94b5f01c5cbc.2020.08.13 opqhAbfDR3OtduE7fsq2FQ 3 1 402000 0 550.3mb 275.1mb green open project.logtest6.1c1f097c-0a56-4e80-b2ff-bcb166a080d1.2020.08.13 svsRcu6kTU2Osi_X8xqx9g 3 1 402000 0 548.5mb 274.2mb green open project.logtest39.42a22cd5-3bb7-40fa-a39e-c63ea81e7f8b.2020.08.13 1EeMCqihRaOwhS9QSslX3Q 3 1 153250 0 211.5mb 105.4mb green open project.logtest80.4dcf3c47-67c3-4af8-a20e-00ce32377cc7.2020.08.13 Kl2Tkha1Qg-ETwdQ-0gKdQ 3 1 202750 0 277.5mb 138.4mb green open project.logtest90.39846254-42da-4b5d-be34-82f14c5975e3.2020.08.13 4DZ5a3C9R8aE5_EXeBYzBg 3 1 153250 0 210.2mb 104.6mb green open project.logtest13.af6f76f8-0eca-4475-9626-b2dcac6b4b73.2020.08.13 BRbBWcLdQqKZ_SO_VCokTQ 3 1 203000 0 278.8mb 139.2mb green open project.logtest14.06a31b33-301b-47cb-a78e-b742af36e96f.2020.08.13 6gNjA0iNTJGDVFYfkKhM1Q 3 1 204000 0 279.7mb 140.1mb green open project.logtest29.c862fa9f-9537-4d25-b49a-76211a8f99b2.2020.08.13 OuMXpr5dRQGy3XH3WHa32w 3 1 204000 0 280.3mb 140.5mb green open project.logtest3.b5a5874d-fb17-4b05-b0b0-a9cd68260b83.2020.08.13 C6NOizcZSYuWEzYRCqdarw 3 1 15500 0 23.1mb 11.4mb green open project.logtest16.b58bcfe5-93e4-4a8b-87ab-ae4bd3122946.2020.08.13 syyaAOjzSfadd7_hsoD4Mw 3 1 302750 0 414.5mb 207.1mb green open project.logtest36.d81afb6c-4002-4789-b602-bf636d7debc6.2020.08.13 3TvDkL9pQBqxDLd7Oa9sXQ 3 1 451750 0 618.7mb 309.1mb green open project.logtest96.d47b7e6f-95b7-4acc-82bd-21121c4e1d51.2020.08.13 1GAD7hAJQ7aArMo2Ei8TrA 3 1 200500 0 276.1mb 137.9mb green open project.logtest27.832873fc-807c-497a-a90c-ad43668c1058.2020.08.13 _aatrWOpRryuv1yc3_CN-g 3 1 253500 0 346.9mb 173.3mb green open project.logtest48.ed54b2e3-ec08-48f6-948a-4e559aa27e76.2020.08.13 MdEZZWRMRdyC-6ihLV7QuA 3 1 202750 0 277.9mb 139mb green open project.logtest50.50f673b5-33dc-464b-a11b-3d0f54f39d03.2020.08.13 5q7ibP-BTPG8ZVvjx5sBiQ 3 1 103750 0 143.4mb 71.5mb green open project.logtest7.403606b2-be6b-4376-af12-b9ae8b857ee2.2020.08.13 M1eMjVDkQ5SaxOObxW7Uqg 3 1 303000 0 415.3mb 207.7mb green open project.logtest8.3dbf737e-2aa8-4f63-ae6e-2a1bdd777713.2020.08.13 R4nm9EaOTViH8t98APETHA 3 1 251750 0 345.6mb 172.8mb green open project.logtest99.9c8736df-c850-41b5-a107-d7af31b0c478.2020.08.13 WIGEqQyKTIaVBgUrOT4dSQ 3 1 203000 0 279.2mb 139.5mb green open project.logtest0.bf7f1765-e5b7-424e-b741-0fd2f73c3b1d.2020.08.13 _NTYjQ11SN-Qyspr0DwqJA 3 1 200500 0 274.7mb 137.9mb green open project.logtest31.327f9312-48f5-4687-8355-030cc116b5bb.2020.08.13 zCK7RdQwTHi7gSxwbhGR0A 3 1 303250 0 415.1mb 208.1mb green open project.logtest17.a535b1b2-c638-43aa-9aee-9e9515dec5b6.2020.08.13 cSPYYwxeTf6wecdto8MBQw 3 1 550500 0 750.3mb 375.2mb green open project.logtest85.3bc6a3fd-4e32-4a05-8211-ec175ea5ec64.2020.08.13 _sCuf0-9Rgmxz3S_N8bDKQ 3 1 103567 0 143.7mb 71.8mb green open project.logtest18.3e85e952-e453-4b52-8923-b9416fdfd93c.2020.08.13 zOUeK3LSSkupxfQ_nRCMKg 3 1 352500 0 482.1mb 241.1mb green open project.logtest3.e802146c-6399-467e-8d1e-fcdaccbcb6ed.2020.08.13 bNpeZbMoTD6zqj6Ybf0Log 3 1 151750 0 208.8mb 104.1mb green open project.logtest26.a589cb51-4f5d-4ee8-9d7a-d01f90844eb5.2020.08.13 1kAgoux2QDyav81JWgTLzA 3 1 303000 0 415.5mb 207.8mb green open project.logtest33.9d74063d-b622-49f7-ab7b-f8946660eaa2.2020.08.13 RIBzZ8NhSEmFp2AXsMFDSQ 3 1 253500 0 346.2mb 173mb green open project.logtest74.095753e8-81bd-4679-9096-ac66b121825b.2020.08.13 r5P19K2dRkKj4j_NMA-G_A 3 1 100500 0 140.1mb 69.9mb green open project.logtest23.bde0e437-ba92-43cf-994d-ea33dc8de76b.2020.08.13 YxG7ukVBRROVMxu2rUV29Q 3 1 352500 0 482.6mb 241.1mb green open project.logtest64.96351148-6dc3-4996-b621-a77dd1cd2f00.2020.08.13 lDh0h2_wRk-xwIGpAJSolg 3 1 100000 0 139.7mb 69.5mb green open project.logtest51.4172ecc8-b056-4677-bc9e-62f1f0c1b876.2020.08.13 m3uK8QN0R22ZV9wh-o1tsQ 3 1 203000 0 279.3mb 139.3mb green open project.logtest35.a70d6498-bccf-4c0c-9b51-6fe96fe7d4e9.2020.08.13 mqXZO-UrSs-gsHylijnOaQ 3 1 201750 0 276.4mb 138.2mb green open project.logtest53.f1074e11-6822-4851-8309-c9844245126a.2020.08.13 28R0VM6dQLC8EFoFwhn23A 3 1 202500 0 277.8mb 138.6mb green open project.logtest94.5366d218-f1b7-4bd1-9c68-35b7058b3295.2020.08.13 5Km_OoT_T9avlomaNu33uA 3 1 152000 0 208.8mb 104.6mb green open project.logtest69.d8aa5b46-57ce-4e8f-9921-7d26b9fb2595.2020.08.13 b9whDhvvTK2CEkTLa2YuAg 3 1 302750 0 414.3mb 206.8mb green open project.logtest10.a31bb02a-3834-404d-93db-4d028ff90c3a.2020.08.13 1eUSUANATRWR20MmKg9Oow 3 1 351750 0 482.7mb 241.2mb green open project.logtest92.e75fbec3-b659-4ed3-b0f8-06b34db01c10.2020.08.13 IFyGQVoaR6KWQ6NPQFdjug 3 1 252500 0 344.7mb 172.6mb green open project.logtest93.a5781f58-63bf-47e8-8150-a5f898aecb3d.2020.08.13 0zqD198BSe-g1GfVthW82Q 3 1 154000 0 211.2mb 105.6mb green open project.logtest76.bca080d1-7a0b-49fe-b706-01a54ddbaf54.2020.08.13 KcrgXBXdS7etK5P1VTAdfQ 3 1 252000 0 345.5mb 172.8mb green open project.logtest68.f9aeca84-3084-4d5f-986f-29f47d576c0b.2020.08.13 9NkFUMsPQYSxx3K7NUtk5w 3 1 202750 0 277.7mb 138.6mb green open project.logtest88.829a9619-62c9-47ee-a734-99d2ed35996f.2020.08.13 KuAN_CVrTMqJdBE0sky22g 3 1 200257 0 274.4mb 137.3mb green open project.logtest0.f16f49bf-17f3-46cc-819d-27cc99930bd9.2020.08.13 xdfxlUYLSkC1VAUCRGHnsA 3 1 16000 0 24.2mb 12mb green open project.logtest56.98a9c838-64f6-4daa-922d-6f1fdad64ea4.2020.08.13 iH5z9zeFSQOA6I3qh1MsvQ 3 1 154000 0 212.4mb 106.2mb green open project.logtest59.daa14fbd-3ec7-48eb-9a95-fab00e21fd2a.2020.08.13 K-LmMZkFRy2F_5FeqmEIQw 3 1 104000 0 143.7mb 72.1mb green open project.logtest65.054d59bf-3b61-4f48-aff8-9c43bff7a1d9.2020.08.13 AqbSKE4STOiIYLGtIIep6w 3 1 100500 0 139.5mb 69.5mb Only 2 out of the 100 indices actually got the expected 600000 messages, the others are lacking a significant amount, even 15 minutes after the logging test stopped. Looking closer at the last one, logtest65: [kni@e16-h18-b03-fc640 ~]$ oc get pods -A -o wide | grep logtest65 logtest65 centos-logtest-8nq8h 1/1 Running 0 40m 10.130.24.7 worker035 <none> <none> Checking the fluentd dir on node worker035: [kni@e16-h18-b03-fc640 ~]$ oc debug node/worker035 Starting pod/worker035-debug ... To use host binaries, run `chroot /host` Pod IP: 192.168.222.48 If you don't see a command prompt, try pressing enter. sh-4.2# sh-4.2# chroot /host sh-4.4# ls -ltr /var/lib/fluentd/ clo_default_output_es/ retry_clo_default_output_es/ sh-4.4# ls -ltr /var/lib/fluentd/ clo_default_output_es/ retry_clo_default_output_es/ sh-4.4# ls -ltr /var/lib/fluentd/clo_default_output_es/ total 8 -rw-r--r--. 1 root root 221 Aug 13 07:05 buffer.b5acbcee6812adcfb761bf2c22534d4a2.log.meta -rw-r--r--. 1 root root 1187 Aug 13 07:05 buffer.b5acbcee6812adcfb761bf2c22534d4a2.log To me it looks like fluentd is not sending buffered messages to ES anymore. A must-gather can be found at http://file.str.redhat.com/mkarg/bz1833486/must-gather.tgz Please let me know if you need any further information from the cluster. Hi Jeff, As the case is still unresolved can you re-open the BZ ? Thanks, Yaniv Ignore comment #31 as the issue is now tracked on a new bz (see https://bugzilla.redhat.com/show_bug.cgi?id=1877818). Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |