Bug 1833486 - Logging performance degraded compared to 4.4
Summary: Logging performance degraded compared to 4.4
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.6.0
Assignee: Jeff Cantrill
QA Contact: Eric Matysek
URL:
Whiteboard:
: 1844639 1846174 (view as bug list)
Depends On:
Blocks: 1847027 1877818
TreeView+ depends on / blocked
 
Reported: 2020-05-08 17:58 UTC by Eric Matysek
Modified: 2020-09-10 14:15 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1877818 (view as bug list)
Environment:
Last Closed: 2020-06-03 15:29:26 UTC
Target Upstream Version:


Attachments (Terms of Use)
Elasticsearch and Fluentd Logs (1.37 MB, application/gzip)
2020-05-11 18:22 UTC, Eric Matysek
no flags Details
Fluentd Logs (37.05 KB, text/plain)
2020-05-27 00:22 UTC, Eric Matysek
no flags Details


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-logging-operator pull 597 None closed Bug 1833486: Improve fluent performance in comparison to 4.4 2020-09-15 16:19:45 UTC
Github openshift cluster-logging-operator pull 600 None closed Bug 1833486: Collector config changes to address performance versus 4.4 2020-09-15 16:19:45 UTC
Github openshift cluster-logging-operator pull 609 None closed Bug 1833486: Revert rotate_wait and refresh_interval to 4.4 setting 2020-09-15 16:19:45 UTC
Github openshift origin-aggregated-logging pull 1904 None closed Bug 1845233: Mapping "template" field becomes "index_patterns" 2020-09-15 16:19:45 UTC
Github openshift origin-aggregated-logging pull 1916 None closed Bug 1838153: Fix index patterns in index templates 2020-09-15 16:19:45 UTC
Github openshift origin-aggregated-logging pull 1939 None closed Bug 1833486: Bump fluentd and dependencies to 1.11.1 2020-09-15 16:19:45 UTC
Github openshift origin-aggregated-logging pull 1953 None closed Bug 1833486: Revert fluentd to 1.7.4 to address performance regression 2020-09-15 16:19:44 UTC

Description Eric Matysek 2020-05-08 17:58:57 UTC
Description of problem:
In v4.4 we can log 2.5k msgs/sec from a single node without issue.
Today on v4.5 I am unable to log at even 1k msgs/sec from a single node and have all my messages appear in elasticsearch.


Version-Release number of selected component (if applicable):
v4.5.0


How reproducible:
100%


Steps to Reproduce:
1. Deploy logging stack from https://github.com/openshift/origin-aggregated-logging.git on release-4.5 branch
2. Run cluster-loader logging workload
3. Check elasticsearch indices

Actual results:
# Test was 1.2M messages at 1k msg/s
$ oc exec elasticsearch-cdm-s89wygkk-1-6ffcfd6f58-5c72x -- indices | grep app
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-s89wygkk-1-6ffcfd6f58-5c72x -n openshift-logging' to see all of the containers in this pod.
green  open   app-000001   bRO5HELwQnmxjwU67WpD5Q   3   1     766500            0        829            414
green  open   app-000003   qf7pqZR0T2yX0oN2HFQ60g   3   1          0            0          0              0
green  open   app-000002   Pa4iwLpESFmOunFKifSSnA   3   1     160000            0        173             86


$ curl 'https://localhost:9200/app*/_count?pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
      "term": {"kubernetes.namespace_name": "logtest-45-0"}
    }
}
'
{
  "count" : 926500,
  "_shards" : {
    "total" : 9,
    "successful" : 9,
    "skipped" : 0,
    "failed" : 0
  }
}


$ python verify_logtest_index.py --stream -i 'app*' -m 1200000
Index document count: 926500
Missing log line(s): 926501-1200000 (273500)
No duplicates found!
Number of missing logs: 273500
22.7917% message loss rate

Expected results:
All 1.2M messagse in app* indices


Additional info:

Comment 1 Jeff Cantrill 2020-05-11 17:04:25 UTC
Please post elasticsearch and fluent logs for review

Comment 2 Eric Matysek 2020-05-11 18:22:07 UTC
Created attachment 1687396 [details]
Elasticsearch and Fluentd Logs

Attached logs as requested.
I generated 600k logs for this cluster at 500/s and only got 370k successfully indexed.

$ python verify_logtest_index.py -i 'app*' --stream -m 600000
Index document count: 370000
Missing log line(s): 1-230000 (230000)
No duplicates found!
Number of missing logs: 230000
38.3333% message loss rate

Comment 3 Jeff Cantrill 2020-05-12 00:24:43 UTC
Maybe caused by https://bugzilla.redhat.com/show_bug.cgi?id=1834558

Comment 4 Eric Matysek 2020-05-15 20:41:15 UTC
The fix for bug 1834558 did significantly improve performance, however I am still seeing a decrease in throughput compared to 4.4 for single pod logging rates.

4.4 we were able to reliable log 2.5k msg/s, even seeing some passes at 3k msg/s passing.
On 4.5 2k seems reliable but I have not been able to get a single pass at 2.5k msg/s

Comment 5 Jeff Cantrill 2020-05-21 20:20:01 UTC
I'm having difficulties standing up a cluster to verify, but please try the following to bypass the proxy to elasticsearch so we may determine how it is affecting performance.

1. oc edit clusterlogging instance (set ManagementState: Unmanaged)
2. oc edit elasticsearch elasticsearch (set ManagementState: Unmanaged)
3. oc edit configmap elasticsearch (modify network be like: https://github.com/openshift/elasticsearch-operator/blob/release-4.2/pkg/k8shandler/configuration_tmpl.go#L17-L18)
4. delete elasticsearch pods to force them to restart in order to load the config
5. oc edit service elasticsearch (Modify "targetPort" to be 9200 instead of 'restapi')


Fluent should already be able to write to ES using its certificates and this should modify the service to direct traffic straight to the ES container and bypass the proxy

Comment 6 Periklis Tsirakidis 2020-05-22 07:10:49 UTC
@ematysek

Have you tried the above comments from https://bugzilla.redhat.com/show_bug.cgi?id=1833486#c5 ?

Comment 7 Eric Matysek 2020-05-27 00:19:49 UTC
Unfortunately I wasn't able to get logs successfully indexed into elasticsearch after making the changes mentioned by Jeff.

Comment 8 Eric Matysek 2020-05-27 00:22:43 UTC
Created attachment 1692476 [details]
Fluentd Logs

Fluentd logs after making changes trying to bypass proxy.
Mostly filled with lines like this:
2020-05-26 19:13:45 +0000 [warn]: [clo_default_output_es] failed to flush the buffer. retry_time=2 next_retry_seconds=2020-05-26 19:13:47 +0000 chunk="5a690759010aca78709f913e131436a9" error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure error="buffer is full."

Comment 9 Jeff Cantrill 2020-05-27 15:46:34 UTC
lowering severity:low as this is not a functional blocker to 4.5

Comment 10 Jeff Cantrill 2020-05-27 18:35:22 UTC
(In reply to Eric Matysek from comment #8)
> Created attachment 1692476 [details]
> Fluentd Logs
> 
> Fluentd logs after making changes trying to bypass proxy.
> Mostly filled with lines like this:
> 2020-05-26 19:13:45 +0000 [warn]: [clo_default_output_es] failed to flush
> the buffer. retry_time=2 next_retry_seconds=2020-05-26 19:13:47 +0000
> chunk="5a690759010aca78709f913e131436a9"
> error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure
> error="buffer is full."

Not certain this is related to the perf issue but definitely seems like it would block forever https://github.com/uken/fluent-plugin-elasticsearch/pull/688

Comment 11 Jeff Cantrill 2020-05-27 18:41:32 UTC
(In reply to Jeff Cantrill from comment #10)
> (In reply to Eric Matysek from comment #8)
> > Created attachment 1692476 [details]
> > Fluentd Logs
> > 
> > Fluentd logs after making changes trying to bypass proxy.
> > Mostly filled with lines like this:
> > 2020-05-26 19:13:45 +0000 [warn]: [clo_default_output_es] failed to flush
> > the buffer. retry_time=2 next_retry_seconds=2020-05-26 19:13:47 +0000
> > chunk="5a690759010aca78709f913e131436a9"
> > error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure
> > error="buffer is full."
> 
> Not certain this is related to the perf issue but definitely seems like it
> would block forever
> https://github.com/uken/fluent-plugin-elasticsearch/pull/688

We have 4.0.5 and this was merged earlier so likely not blocking

Comment 12 Lukas Vlcek 2020-05-28 12:18:24 UTC
I would be interested in learning how many primary shards and replicas we have for each index in ES in this case.

Can you please get indices status? You can use the "indices" utility.
https://github.com/openshift/origin-aggregated-logging/blob/master/elasticsearch/utils/indices

Given we should have significantly less indices now we need to check if we can get better performance by using more shards.

Comment 13 Lukas Vlcek 2020-05-28 12:38:33 UTC
I found a bug. We did not update index name patterns in common.settings.* files in https://github.com/openshift/origin-aggregated-logging/tree/master/elasticsearch/index_templates
All of them are still assuming old index naming conventions ("project*",".operations*", ...etc). These files are not part of https://github.com/ViaQ/elasticsearch-templates so they need to be updated separately.

That should mean that indices should use the default Elasticsearch sharding (https://www.elastic.co/guide/en/elasticsearch/reference/6.8/indices-create-index.html#create-index-settings) which is 5 shards and 1 replicas. Indices status report should confirm this.

Comment 14 Jeff Cantrill 2020-06-02 21:16:08 UTC
Updating to target 4.6 to get these into release and backported to 4.5. requested backport in GH

Comment 15 Eric Matysek 2020-06-02 21:26:07 UTC
(In reply to Lukas Vlcek from comment #13)

> That should mean that indices should use the default Elasticsearch sharding
> (https://www.elastic.co/guide/en/elasticsearch/reference/6.8/indices-create-
> index.html#create-index-settings) which is 5 shards and 1 replicas. Indices
> status report should confirm this.

You can see in my original comment that the app* indices had 3 primary shards and 1 replica.
I think if the bottleneck was elasticsearch itself I would be seeing some sort of anomaly in elasticsearch pod cpu/memory usage

Comment 17 Mike Fiedler 2020-06-03 17:07:58 UTC
This is a regression and is a loss of pod log messages.  Tagging as regression.  I think this needs to be targeted for 4.5?

Comment 18 Eric Matysek 2020-06-24 19:41:34 UTC
Here is an example of random chunks of messages being lost to differentiate between this bug and #1846174
Test setup:
1 project
2.4M lines
2k/s rate

$ python verify_logtest_index.py --stream -i app* -m 2400000
Index document count: 330000
Missing log line(s): 273001-1799000 (1526000)
Missing log line(s): 1856001-2400000 (544000)
No duplicates found!
Number of missing logs: 2070000
86.2500% message loss rate

Comment 19 Periklis Tsirakidis 2020-07-06 07:25:55 UTC
Put UpcomingSprint, as ongoing investigation according to [1], PR seems not conclusive yet.

[1] https://coreos.slack.com/archives/CB3HXM2QK/p1593547317176100

Comment 20 Mike Fiedler 2020-07-07 20:11:41 UTC
*** Bug 1846174 has been marked as a duplicate of this bug. ***

Comment 21 Jeff Cantrill 2020-07-08 19:22:39 UTC
*** Bug 1844639 has been marked as a duplicate of this bug. ***

Comment 22 Jeff Cantrill 2020-07-08 19:53:10 UTC
Rebuilt the 4.4 image to be 4.5 compatible (e.g hostname, updated viaq plugin, wait_for_es_script) and reliably ran "WORKLOAD_DIR=$(pwd) NUM_LINES=1125000 RATE=150000 ansible-playbook -v -i inventory workloads/logging.yml" three times successively without fail.

Switched to the 4.5 image and ran same test which passed the first time and has failed 2 subsequent times

Comment 27 Eric Matysek 2020-07-27 19:33:10 UTC
Unfortunately this bug still exists for me.
I've also tried deploying CLO/EO 4.4 on a 4.5 OCP cluster and am able to log at the expected rates so I think that observation confirms this is introduced in CLO/EO rather than something in OCP

Comment 28 Eric Matysek 2020-07-29 19:40:27 UTC
Confirmed PR 1953 fixes this issue and we are able to log at 2.5k/s consistently

Comment 29 Marko Karg 2020-08-13 06:09:03 UTC
I can confirm that logging at 2.5k/s works fine on 4.5.5 but going for higher rates leaves me with messages stuck in the fluentd pods and never getting delivered to ES.

Comment 30 Marko Karg 2020-08-13 08:11:47 UTC
I've reran a test with 100 pods in 100 namespaces, logging at 50000 msgs / sec in total for 20 minutes, which should result in 600000 messages in every index. 
This is what I got:

green  open   project.logtest98.fdfae9ef-a84a-4c7e-8f70-e580ddf60159.2020.08.13        f3roeu5rR1utKqlASrfTlQ   3   1     451750            0    617.1mb        308.6mb
green  open   project.logtest83.bfbfa57a-2794-4450-90f2-7f9d283c5365.2020.08.13        Lozpgwd9SLmfImn4OQW9fg   3   1     103500            0    143.3mb         72.2mb
green  open   project.logtest86.dcc4f112-0328-49e4-b177-97fb2527b5d8.2020.08.13        POLDYc3HTXyBB30PFZ4H8g   3   1     203000            0    277.4mb          138mb
green  open   project.logtest61.cc3784c7-7702-4b64-84bc-bb5b52235677.2020.08.13        2-mxFKBpTdCoa1C_exjQSw   3   1     100250            0    139.4mb         69.8mb
green  open   project.logtest82.abc253a2-965d-4493-bc30-cfda648d0124.2020.08.13        0MXvzY-qTw6A201EdWoaCA   3   1     201500            0    255.4mb        127.6mb
green  open   project.logtest57.9111cd70-40fa-464d-8a9c-1303ee60e31b.2020.08.13        n7wWZum-S6-6EdpAjoPfzw   3   1     151750            0    208.1mb        103.8mb
green  open   project.logtest40.e924f170-4805-4a71-b10a-c17167d56a58.2020.08.13        n7f_RiqPQHO2O_WyALjLxQ   3   1     200750            0    273.7mb          137mb
green  open   project.logtest37.226c2a8f-e2bd-4154-bc07-7d07e0835389.2020.08.13        QRO2_AEzT_WXwZYfh-ElXw   3   1     252750            0      346mb          173mb
green  open   project.logtest5.c9521dac-510d-40a8-b7b6-94d74851d28a.2020.08.13         8jsCUniUSpu8s4JbXYGdVg   3   1     201500            0    277.4mb        138.5mb
green  open   project.logtest34.1c1679fb-ba08-460d-bc81-1103ffbac4d3.2020.08.13        Hqu7M6C9Q5GUu4EpsYs2_g   3   1     251500            0    344.3mb        171.4mb
green  open   project.logtest97.1f516add-6c98-4be1-9c87-3fd35aa1ec9b.2020.08.13        VmNbro9mTAWxBOz-N0PVog   3   1     151750            0    208.5mb        104.2mb
green  open   project.logtest28.0872992b-4ca1-43ba-84de-dea2b1284098.2020.08.13        w5543si2Rv-Z4yWE0IwKnw   3   1     152000            0    209.5mb        104.8mb
green  open   project.logtest71.f293aeec-656f-421d-be27-a806e68b8dea.2020.08.13        PnSpUG2yQWaBNflO1Dx-jA   3   1     352000            0    481.1mb        240.6mb
green  open   project.logtest70.2e02f50f-a5ea-4e77-89ad-d033370e2066.2020.08.13        k8DIiOaxQ7WDL9LaVZzGKw   3   1     100000            0    138.2mb         68.5mb
green  open   project.logtest8.017152b5-d477-4147-8518-f02358fcec21.2020.08.13         AzHHg0LPTAG_nWlYUhRMFg   3   1      16000            0     24.4mb         12.1mb
green  open   project.logtest75.0e3a3449-e98b-4994-959d-500553bff5bc.2020.08.13        LV-IwSTxSl-9vpGOAnWEqA   3   1     100000            0    137.5mb         68.5mb
green  open   project.logtest20.5da98cdb-7f58-40ad-a92d-59d6cb20a4d0.2020.08.13        qjOK8yd0RtinDDi9zGz44w   3   1     103500            0    142.7mb         71.6mb
green  open   project.logtest30.0099fef1-28b3-4902-9e1c-9a23aefd78bc.2020.08.13        0vYDJT90Q7eTg5wFUItXeg   3   1     204000            0    280.2mb        140.5mb
green  open   project.logtest84.e6e79c7f-142a-42c3-a353-f68a6372634b.2020.08.13        81ft2GteTh2i9WphdqlzJQ   3   1     202750            0    277.8mb        138.8mb
green  open   project.logtest63.2a8f44bb-2307-4399-8479-6796c77a8908.2020.08.13        1j9mFcZYRlqsQogrN5N0Ag   3   1     202500            0    278.3mb        138.6mb
green  open   project.logtest87.4019da71-7b97-47dd-b473-5baf49d90f21.2020.08.13        FfCJF2IvSU2BHW1ohTIPLg   3   1     200000            0    273.9mb        137.1mb
green  open   project.logtest22.14853e9d-dd11-4aef-ba4e-3c19395a0d95.2020.08.13        T8aBvfFfStOdkEvczUItpQ   3   1     203000            0    278.7mb        139.2mb
green  open   project.logtest58.230cf278-d81e-4957-981e-0f994dc24c22.2020.08.13        eAKnTToATtOiiF0mTXmRtg   3   1     103750            0    142.2mb         70.9mb
green  open   project.logtest6.12b8e87a-de40-4e5b-a85e-8acbe52519c9.2020.08.13         Gr2GwGXyRpOfxH_VPzvgIQ   3   1      16000            0     24.3mb         12.1mb
green  open   project.logtest91.4b837eda-5c21-4b38-8351-d797afcf2182.2020.08.13        L_K32ayrTJyOl89c8oKKMg   3   1     252500            0    346.5mb        172.9mb
green  open   project.logtest42.38ce51a4-88e2-4e88-a820-62b5b4275c1c.2020.08.13        7KaHvVMhT7G66jkVdPvHPA   3   1     203000            0      277mb        138.5mb
green  open   project.logtest67.95bf0ab6-ac87-4a38-8ffa-e637163c157d.2020.08.13        yFAgph3xR2CnwuKMgv91UQ   3   1     100750            0    139.3mb         69.7mb
green  open   project.logtest12.2eb97788-8aa8-4dfa-a5f7-6b4b9f463c8b.2020.08.13        RnEbfFOkRvWWLnD2cHEYSA   3   1     501000            0    684.3mb        342.2mb
green  open   project.logtest79.d4d778fe-78df-4d2a-a941-1c6f641c2875.2020.08.13        NiDbcNIdQAm_cWEyM2ldsQ   3   1     149750            0    207.6mb        103.9mb
green  open   project.logtest47.600e17ea-c88f-447e-a9f1-1a610c1fa5d4.2020.08.13        8dATJn1AT0e3WMVIWYkQQg   3   1     204000            0    279.8mb        139.8mb
green  open   project.logtest41.572529c5-6cdb-4389-bd2f-7c78c98b3ba9.2020.08.13        lMoEmd_KQaC6_mu0L03fLw   3   1     352750            0    483.4mb        241.6mb
green  open   project.logtest2.f60f1a2c-8c92-4eda-961c-284eff3c862c.2020.08.13         gdcmORREQD6LGcivPOqGIw   3   1     302000            0      413mb        206.2mb
green  open   project.logtest49.bd65fda1-fcda-4811-9921-dbf857c53687.2020.08.13        YhpsDLrSS9KKwQcj4AV9dw   3   1     202805            0    278.4mb        138.9mb
green  open   project.logtest32.bfec867f-d805-48cd-938f-a3b219027850.2020.08.13        P0ja2ulNRQKBri8jPuM04Q   3   1     101000            0    140.2mb         69.9mb
green  open   project.logtest38.fb0a2902-89b4-47f5-a447-e30323ddc4db.2020.08.13        jqwzyEBhQ_S3kpXJechkuA   3   1     252750            0    344.9mb        172.7mb
green  open   project.logtest62.b4cc0582-835e-4a6a-b355-e603590789fd.2020.08.13        mYjNpdrxRxeKdRlTOrE6vw   3   1     202500            0    276.7mb        138.2mb
green  open   project.logtest9.bd3ab32d-a801-4ecf-bb86-080bf26729f8.2020.08.13         YZ9dUEVVTR-TWO4V2fERrg   3   1     252250            0    345.5mb        172.5mb
green  open   project.logtest21.ef9394cb-6ed8-48f1-a0e0-8366c3705fee.2020.08.13        jscgX1nERCKkhxr0Kofjew   3   1     351750            0      482mb        240.9mb
green  open   project.logtest95.3bf7b5ca-7cf6-45a1-b529-476158cc20ea.2020.08.13        cGGkP_U9SIuf1JyeH2jiZA   3   1     252250            0    345.1mb        172.7mb
green  open   project.logtest1.721c4164-e3ec-42ca-a1ee-738abec432d6.2020.08.13         G-11jZRVQkqs270rRy7OmQ   3   1     600000            0      819mb        409.6mb
green  open   project.logtest4.3ee3b7b0-724c-4c23-bbcd-c3fc1bfe266a.2020.08.13         KGEtDaNCRQa9qCeawA0uDw   3   1     501000            0    685.5mb        342.8mb
green  open   project.logtest24.3e88b5a3-3754-49ae-9aa5-1c9f643c255c.2020.08.13        o_CWjoAITdOPis1m_vvQ9w   3   1     600000            0    817.5mb        408.6mb
green  open   project.logtest81.e19926d0-59bb-4850-b469-b775b9dd9661.2020.08.13        VR5A1qAIRoOPZx0eMdGOaw   3   1     256723            0    326.3mb        163.5mb
green  open   project.logtest60.f6a9c1ac-0302-496b-9d14-4de7ef78eeb9.2020.08.13        T-Mfo8uwTD6ctrhcxmo3PA   3   1     202630            0    278.3mb        138.5mb
green  open   project.logtest77.536c1a73-7c38-4825-94a1-f4ed32c65641.2020.08.13        QRRjY5YNTuSltgVM-lMTPg   3   1     102000            0    141.3mb         70.7mb
green  open   project.logtest11.c6482750-a9e8-4dbc-b3f4-1402ce1d401e.2020.08.13        4AiBAfsFTPalXf623vyTIQ   3   1     253750            0    347.5mb        173.7mb
green  open   project.logtest43.3cd984b9-c2da-4bbe-b513-7f821e818ec9.2020.08.13        6WUk-8IbQoCpgNYH0lRfJg   3   1     153750            0    211.4mb        105.8mb
green  open   project.logtest72.ba3aa48b-8358-4a4c-a512-717f236142cb.2020.08.13        wIwwNQ77Qd62RjVQWCWSMA   3   1     251750            0    318.3mb          159mb
green  open   project.logtest52.c88f7265-fa2d-4813-a36a-f8a2054a55f6.2020.08.13        Wfdnm6TyQG2_xjhjm0HqKA   3   1     203000            0    278.7mb        139.6mb
green  open   project.logtest78.bb3de4aa-9f42-4f58-911a-383aa3a47b39.2020.08.13        NvHzfFJ7TTGeq3-xKTu07Q   3   1     152000            0    194.3mb         97.2mb
green  open   project.logtest15.dda34629-0e1d-46ef-8c56-f7d7c4b9c781.2020.08.13        doiW1BI2R3moeky_I1Hw7g   3   1     252750            0    348.1mb        174.1mb
green  open   project.logtest46.38645aac-c60a-4175-95be-b0936e44b63f.2020.08.13        WGhDKeaCSIOLAMKcB8AoYw   3   1     402000            0    548.6mb        274.6mb
green  open   project.logtest55.fbfc4a80-f54f-4946-9897-c0089dda8520.2020.08.13        Gk7nV1vuRSeQPbFunOib_g   3   1     402000            0    550.3mb        275.2mb
green  open   project.logtest54.532a44d6-2841-45f3-91fc-d3ee96033d57.2020.08.13        d5ESPmkeTrufgS222NxY5Q   3   1     151580            0    208.9mb        104.5mb
green  open   project.logtest45.ee24e156-b682-482a-a99f-d5aa2b6447df.2020.08.13        NJE7qfr9QDKZ2-z17CnGuQ   3   1     100500            0    139.4mb         69.5mb
green  open   project.logtest89.347a235a-5682-4bca-8a29-b0de6b18c59d.2020.08.13        gSP26jjpTre5uMWbOT1QhQ   3   1      51000            0     71.2mb         35.7mb
green  open   project.logtest19.bc3dbbd3-e996-4b28-8a38-dbf2a6959929.2020.08.13        i7CpFQ93StGJV_ZNUXfc7Q   3   1     203250            0    279.6mb        139.6mb
green  open   project.logtest25.a47ec1ff-d1ab-4d85-af73-79e6876df681.2020.08.13        bR2ZqQMWQQqrVeSPwVk8hA   3   1     252750            0    345.8mb        173.2mb
green  open   project.logtest73.b0fa5917-6c01-48e6-997b-4b8fec79258d.2020.08.13        9MaOc2B2SFG5akTSzmLXJg   3   1     203000            0    278.2mb        139.3mb
green  open   project.logtest66.5ff3ca9b-f769-41a5-abbf-f649b4053d24.2020.08.13        AxDKp6PuRwaBG3JrC4RnhA   3   1     102000            0    141.1mb         71.1mb
green  open   project.logtest4.4a8cf193-b552-467f-bd39-42e15dea26a0.2020.08.13         h1VuZy2jSDWsiDxgeBTGBA   3   1      15500            0     23.9mb         11.9mb
green  open   project.logtest44.37095fb0-5272-4baf-8a43-94b5f01c5cbc.2020.08.13        opqhAbfDR3OtduE7fsq2FQ   3   1     402000            0    550.3mb        275.1mb
green  open   project.logtest6.1c1f097c-0a56-4e80-b2ff-bcb166a080d1.2020.08.13         svsRcu6kTU2Osi_X8xqx9g   3   1     402000            0    548.5mb        274.2mb
green  open   project.logtest39.42a22cd5-3bb7-40fa-a39e-c63ea81e7f8b.2020.08.13        1EeMCqihRaOwhS9QSslX3Q   3   1     153250            0    211.5mb        105.4mb
green  open   project.logtest80.4dcf3c47-67c3-4af8-a20e-00ce32377cc7.2020.08.13        Kl2Tkha1Qg-ETwdQ-0gKdQ   3   1     202750            0    277.5mb        138.4mb
green  open   project.logtest90.39846254-42da-4b5d-be34-82f14c5975e3.2020.08.13        4DZ5a3C9R8aE5_EXeBYzBg   3   1     153250            0    210.2mb        104.6mb
green  open   project.logtest13.af6f76f8-0eca-4475-9626-b2dcac6b4b73.2020.08.13        BRbBWcLdQqKZ_SO_VCokTQ   3   1     203000            0    278.8mb        139.2mb
green  open   project.logtest14.06a31b33-301b-47cb-a78e-b742af36e96f.2020.08.13        6gNjA0iNTJGDVFYfkKhM1Q   3   1     204000            0    279.7mb        140.1mb
green  open   project.logtest29.c862fa9f-9537-4d25-b49a-76211a8f99b2.2020.08.13        OuMXpr5dRQGy3XH3WHa32w   3   1     204000            0    280.3mb        140.5mb
green  open   project.logtest3.b5a5874d-fb17-4b05-b0b0-a9cd68260b83.2020.08.13         C6NOizcZSYuWEzYRCqdarw   3   1      15500            0     23.1mb         11.4mb
green  open   project.logtest16.b58bcfe5-93e4-4a8b-87ab-ae4bd3122946.2020.08.13        syyaAOjzSfadd7_hsoD4Mw   3   1     302750            0    414.5mb        207.1mb
green  open   project.logtest36.d81afb6c-4002-4789-b602-bf636d7debc6.2020.08.13        3TvDkL9pQBqxDLd7Oa9sXQ   3   1     451750            0    618.7mb        309.1mb
green  open   project.logtest96.d47b7e6f-95b7-4acc-82bd-21121c4e1d51.2020.08.13        1GAD7hAJQ7aArMo2Ei8TrA   3   1     200500            0    276.1mb        137.9mb
green  open   project.logtest27.832873fc-807c-497a-a90c-ad43668c1058.2020.08.13        _aatrWOpRryuv1yc3_CN-g   3   1     253500            0    346.9mb        173.3mb
green  open   project.logtest48.ed54b2e3-ec08-48f6-948a-4e559aa27e76.2020.08.13        MdEZZWRMRdyC-6ihLV7QuA   3   1     202750            0    277.9mb          139mb
green  open   project.logtest50.50f673b5-33dc-464b-a11b-3d0f54f39d03.2020.08.13        5q7ibP-BTPG8ZVvjx5sBiQ   3   1     103750            0    143.4mb         71.5mb
green  open   project.logtest7.403606b2-be6b-4376-af12-b9ae8b857ee2.2020.08.13         M1eMjVDkQ5SaxOObxW7Uqg   3   1     303000            0    415.3mb        207.7mb
green  open   project.logtest8.3dbf737e-2aa8-4f63-ae6e-2a1bdd777713.2020.08.13         R4nm9EaOTViH8t98APETHA   3   1     251750            0    345.6mb        172.8mb
green  open   project.logtest99.9c8736df-c850-41b5-a107-d7af31b0c478.2020.08.13        WIGEqQyKTIaVBgUrOT4dSQ   3   1     203000            0    279.2mb        139.5mb
green  open   project.logtest0.bf7f1765-e5b7-424e-b741-0fd2f73c3b1d.2020.08.13         _NTYjQ11SN-Qyspr0DwqJA   3   1     200500            0    274.7mb        137.9mb
green  open   project.logtest31.327f9312-48f5-4687-8355-030cc116b5bb.2020.08.13        zCK7RdQwTHi7gSxwbhGR0A   3   1     303250            0    415.1mb        208.1mb
green  open   project.logtest17.a535b1b2-c638-43aa-9aee-9e9515dec5b6.2020.08.13        cSPYYwxeTf6wecdto8MBQw   3   1     550500            0    750.3mb        375.2mb
green  open   project.logtest85.3bc6a3fd-4e32-4a05-8211-ec175ea5ec64.2020.08.13        _sCuf0-9Rgmxz3S_N8bDKQ   3   1     103567            0    143.7mb         71.8mb
green  open   project.logtest18.3e85e952-e453-4b52-8923-b9416fdfd93c.2020.08.13        zOUeK3LSSkupxfQ_nRCMKg   3   1     352500            0    482.1mb        241.1mb
green  open   project.logtest3.e802146c-6399-467e-8d1e-fcdaccbcb6ed.2020.08.13         bNpeZbMoTD6zqj6Ybf0Log   3   1     151750            0    208.8mb        104.1mb
green  open   project.logtest26.a589cb51-4f5d-4ee8-9d7a-d01f90844eb5.2020.08.13        1kAgoux2QDyav81JWgTLzA   3   1     303000            0    415.5mb        207.8mb
green  open   project.logtest33.9d74063d-b622-49f7-ab7b-f8946660eaa2.2020.08.13        RIBzZ8NhSEmFp2AXsMFDSQ   3   1     253500            0    346.2mb          173mb
green  open   project.logtest74.095753e8-81bd-4679-9096-ac66b121825b.2020.08.13        r5P19K2dRkKj4j_NMA-G_A   3   1     100500            0    140.1mb         69.9mb
green  open   project.logtest23.bde0e437-ba92-43cf-994d-ea33dc8de76b.2020.08.13        YxG7ukVBRROVMxu2rUV29Q   3   1     352500            0    482.6mb        241.1mb
green  open   project.logtest64.96351148-6dc3-4996-b621-a77dd1cd2f00.2020.08.13        lDh0h2_wRk-xwIGpAJSolg   3   1     100000            0    139.7mb         69.5mb
green  open   project.logtest51.4172ecc8-b056-4677-bc9e-62f1f0c1b876.2020.08.13        m3uK8QN0R22ZV9wh-o1tsQ   3   1     203000            0    279.3mb        139.3mb
green  open   project.logtest35.a70d6498-bccf-4c0c-9b51-6fe96fe7d4e9.2020.08.13        mqXZO-UrSs-gsHylijnOaQ   3   1     201750            0    276.4mb        138.2mb
green  open   project.logtest53.f1074e11-6822-4851-8309-c9844245126a.2020.08.13        28R0VM6dQLC8EFoFwhn23A   3   1     202500            0    277.8mb        138.6mb
green  open   project.logtest94.5366d218-f1b7-4bd1-9c68-35b7058b3295.2020.08.13        5Km_OoT_T9avlomaNu33uA   3   1     152000            0    208.8mb        104.6mb
green  open   project.logtest69.d8aa5b46-57ce-4e8f-9921-7d26b9fb2595.2020.08.13        b9whDhvvTK2CEkTLa2YuAg   3   1     302750            0    414.3mb        206.8mb
green  open   project.logtest10.a31bb02a-3834-404d-93db-4d028ff90c3a.2020.08.13        1eUSUANATRWR20MmKg9Oow   3   1     351750            0    482.7mb        241.2mb
green  open   project.logtest92.e75fbec3-b659-4ed3-b0f8-06b34db01c10.2020.08.13        IFyGQVoaR6KWQ6NPQFdjug   3   1     252500            0    344.7mb        172.6mb
green  open   project.logtest93.a5781f58-63bf-47e8-8150-a5f898aecb3d.2020.08.13        0zqD198BSe-g1GfVthW82Q   3   1     154000            0    211.2mb        105.6mb
green  open   project.logtest76.bca080d1-7a0b-49fe-b706-01a54ddbaf54.2020.08.13        KcrgXBXdS7etK5P1VTAdfQ   3   1     252000            0    345.5mb        172.8mb
green  open   project.logtest68.f9aeca84-3084-4d5f-986f-29f47d576c0b.2020.08.13        9NkFUMsPQYSxx3K7NUtk5w   3   1     202750            0    277.7mb        138.6mb
green  open   project.logtest88.829a9619-62c9-47ee-a734-99d2ed35996f.2020.08.13        KuAN_CVrTMqJdBE0sky22g   3   1     200257            0    274.4mb        137.3mb
green  open   project.logtest0.f16f49bf-17f3-46cc-819d-27cc99930bd9.2020.08.13         xdfxlUYLSkC1VAUCRGHnsA   3   1      16000            0     24.2mb           12mb
green  open   project.logtest56.98a9c838-64f6-4daa-922d-6f1fdad64ea4.2020.08.13        iH5z9zeFSQOA6I3qh1MsvQ   3   1     154000            0    212.4mb        106.2mb
green  open   project.logtest59.daa14fbd-3ec7-48eb-9a95-fab00e21fd2a.2020.08.13        K-LmMZkFRy2F_5FeqmEIQw   3   1     104000            0    143.7mb         72.1mb
green  open   project.logtest65.054d59bf-3b61-4f48-aff8-9c43bff7a1d9.2020.08.13        AqbSKE4STOiIYLGtIIep6w   3   1     100500            0    139.5mb         69.5mb

Only 2 out of the 100 indices actually got the expected 600000 messages, the others are lacking a significant amount, even 15 minutes after the logging test stopped. Looking closer at the last one, logtest65:

[kni@e16-h18-b03-fc640 ~]$ oc get pods -A -o wide | grep logtest65
logtest65                                          centos-logtest-8nq8h                                         1/1     Running     0          40m     10.130.24.7       worker035   <none>           <none>

Checking the fluentd dir on node worker035:

[kni@e16-h18-b03-fc640 ~]$ oc debug node/worker035
Starting pod/worker035-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.222.48
If you don't see a command prompt, try pressing enter.
sh-4.2# 
sh-4.2# chroot /host
sh-4.4# ls -ltr /var/lib/fluentd/
clo_default_output_es/       retry_clo_default_output_es/ 
sh-4.4# ls -ltr /var/lib/fluentd/
clo_default_output_es/       retry_clo_default_output_es/ 
sh-4.4# ls -ltr /var/lib/fluentd/clo_default_output_es/
total 8
-rw-r--r--. 1 root root  221 Aug 13 07:05 buffer.b5acbcee6812adcfb761bf2c22534d4a2.log.meta
-rw-r--r--. 1 root root 1187 Aug 13 07:05 buffer.b5acbcee6812adcfb761bf2c22534d4a2.log

To me it looks like fluentd is not sending buffered messages to ES anymore. 

A must-gather can be found at http://file.str.redhat.com/mkarg/bz1833486/must-gather.tgz

Please let me know if you need any further information from the cluster.

Comment 31 Yaniv Joseph 2020-09-10 13:21:32 UTC
Hi Jeff,

As the case is still unresolved can you re-open the BZ ?

Thanks,
Yaniv

Comment 32 Yaniv Joseph 2020-09-10 14:15:10 UTC
Ignore comment #31 as the issue is now tracked on a new bz (see https://bugzilla.redhat.com/show_bug.cgi?id=1877818).


Note You need to log in before you can comment on or make changes to this bug.