Bug 1844639 - Dropped messages while logging at scale
Summary: Dropped messages while logging at scale
Keywords:
Status: CLOSED DUPLICATE of bug 1833486
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Vimal Kumar
QA Contact: Eric Matysek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-05 21:09 UTC by Eric Matysek
Modified: 2023-09-14 06:01 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-08 19:22:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pod logs (168.39 KB, application/gzip)
2020-06-09 17:43 UTC, Eric Matysek
no flags Details

Description Eric Matysek 2020-06-05 21:09:50 UTC
Description of problem:
While testing max logging throughput for elasticsearch I am seeing missing messages at logging rates that are much lower than expected.

Version-Release number of selected component (if applicable):
4.5

How reproducible:
100%

Steps to Reproduce:
1. Deploy logging
2. create 100 projects logging at 100 msgs/sec
3. Check total documents in app index

Actual results:
All projects were set to logg 360,000 messages
4 projects are missing messages:
$ 'https://localhost:9200/app*/_count?pretty' -H 'Content-Type: application/json' -d '{"query":{"term":{"kubernetes.namespace_name":"logtest10"}}}' | jq '.count'
321700
$ 'https://localhost:9200/app*/_count?pretty' -H 'Content-Type: application/json' -d '{"query":{"term":{"kubernetes.namespace_name":"logtest22"}}}' | jq '.count'
231100
$ 'https://localhost:9200/app*/_count?pretty' -H 'Content-Type: application/json' -d '{"query":{"term":{"kubernetes.namespace_name":"logtest28"}}}' | jq '.count'
275800
$ 'https://localhost:9200/app*/_count?pretty' -H 'Content-Type: application/json' -d '{"query":{"term":{"kubernetes.namespace_name":"logtest35"}}}' | jq '.count'
229250

Total messages indexed from all projects (should be 3.6m):
$ 'https://localhost:9200/app*/_count?pretty' -H 'Content-Type: application/json' -d '{"query":{"wildcard":{"kubernetes.namespace_name":{"value":"logtest*","boost":1,"rewrite":"constant_score"}}}}' | jq '.count'
35617850


Expected results:
Should be 360k messages per project and 4.6M messages total

Additional info:
CPU usage is roughly 2-2.5 per elasticsearch pod out of a reservation of 12 cores so I don't think this is due to a limitation of resources.
A maximum of 7 of these projects were assigned to a single node for this test so we should be well within the capabilities of fluentd

Comment 1 Jeff Cantrill 2020-06-07 23:17:45 UTC
Targeting 4.6, Severity urgent.  @Eric, do you have a test where maybe we can write the logs to /dev/null or someplace other then Elastic to baseline the collector.  My expectation is the difference between 4.4 and 4.5 from a collection stand point would be no different since it is the same config and fluent libraries

Comment 2 Lukas Vlcek 2020-06-08 13:09:45 UTC
Could we get logging dump from this cluster please?
I am interested in number of rejected bulk requests.

Comment 3 Eric Matysek 2020-06-09 17:40:45 UTC
Here are the threads after a failed scale workload:
node_name                    name                host       completed active queue rejected
elasticsearch-cdm-dxnd91j5-1 analyze             10.130.2.6         0      0     0        0
elasticsearch-cdm-dxnd91j5-1 fetch_shard_started 10.130.2.6         0      0     0        0
elasticsearch-cdm-dxnd91j5-1 fetch_shard_store   10.130.2.6         0      0     0        0
elasticsearch-cdm-dxnd91j5-1 flush               10.130.2.6       897      0     0        0
elasticsearch-cdm-dxnd91j5-1 force_merge         10.130.2.6         0      0     0        0
elasticsearch-cdm-dxnd91j5-1 generic             10.130.2.6      9028      0     0        0
elasticsearch-cdm-dxnd91j5-1 get                 10.130.2.6      1779      0     0        0
elasticsearch-cdm-dxnd91j5-1 index               10.130.2.6         0      0     0        0
elasticsearch-cdm-dxnd91j5-1 listener            10.130.2.6         0      0     0        0
elasticsearch-cdm-dxnd91j5-1 management          10.130.2.6    197426      1     0        0
elasticsearch-cdm-dxnd91j5-1 refresh             10.130.2.6     25722      0     0        0
elasticsearch-cdm-dxnd91j5-1 search              10.130.2.6         7      0     0        0
elasticsearch-cdm-dxnd91j5-1 search_throttled    10.130.2.6         0      0     0        0
elasticsearch-cdm-dxnd91j5-1 snapshot            10.130.2.6         0      0     0        0
elasticsearch-cdm-dxnd91j5-1 warmer              10.130.2.6     65323      0     0        0
elasticsearch-cdm-dxnd91j5-1 write               10.130.2.6    346215      1     0        0
elasticsearch-cdm-dxnd91j5-2 analyze             10.131.2.5         0      0     0        0
elasticsearch-cdm-dxnd91j5-2 fetch_shard_started 10.131.2.5         0      0     0        0
elasticsearch-cdm-dxnd91j5-2 fetch_shard_store   10.131.2.5         0      0     0        0
elasticsearch-cdm-dxnd91j5-2 flush               10.131.2.5       896      0     0        0
elasticsearch-cdm-dxnd91j5-2 force_merge         10.131.2.5         0      0     0        0
elasticsearch-cdm-dxnd91j5-2 generic             10.131.2.5      2718      0     0        0
elasticsearch-cdm-dxnd91j5-2 get                 10.131.2.5      1772      0     0        0
elasticsearch-cdm-dxnd91j5-2 index               10.131.2.5         0      0     0        0
elasticsearch-cdm-dxnd91j5-2 listener            10.131.2.5         0      0     0        0
elasticsearch-cdm-dxnd91j5-2 management          10.131.2.5    195404      1     0        0
elasticsearch-cdm-dxnd91j5-2 refresh             10.131.2.5     19822      0     0        0
elasticsearch-cdm-dxnd91j5-2 search              10.131.2.5         7      0     0        0
elasticsearch-cdm-dxnd91j5-2 search_throttled    10.131.2.5         0      0     0        0
elasticsearch-cdm-dxnd91j5-2 snapshot            10.131.2.5         0      0     0        0
elasticsearch-cdm-dxnd91j5-2 warmer              10.131.2.5     65293      0     0        0
elasticsearch-cdm-dxnd91j5-2 write               10.131.2.5    346268      0     0        0
elasticsearch-cdm-dxnd91j5-3 analyze             10.128.4.7         0      0     0        0
elasticsearch-cdm-dxnd91j5-3 fetch_shard_started 10.128.4.7         0      0     0        0
elasticsearch-cdm-dxnd91j5-3 fetch_shard_store   10.128.4.7         0      0     0        0
elasticsearch-cdm-dxnd91j5-3 flush               10.128.4.7       891      0     0        0
elasticsearch-cdm-dxnd91j5-3 force_merge         10.128.4.7         0      0     0        0
elasticsearch-cdm-dxnd91j5-3 generic             10.128.4.7      8560      0     0        0
elasticsearch-cdm-dxnd91j5-3 get                 10.128.4.7         0      0     0        0
elasticsearch-cdm-dxnd91j5-3 index               10.128.4.7         0      0     0        0
elasticsearch-cdm-dxnd91j5-3 listener            10.128.4.7         0      0     0        0
elasticsearch-cdm-dxnd91j5-3 management          10.128.4.7    195572      1     0        0
elasticsearch-cdm-dxnd91j5-3 refresh             10.128.4.7     19820      0     0        0
elasticsearch-cdm-dxnd91j5-3 search              10.128.4.7         6      0     0        0
elasticsearch-cdm-dxnd91j5-3 search_throttled    10.128.4.7         0      0     0        0
elasticsearch-cdm-dxnd91j5-3 snapshot            10.128.4.7         0      0     0        0
elasticsearch-cdm-dxnd91j5-3 warmer              10.128.4.7     65039      0     0        0
elasticsearch-cdm-dxnd91j5-3 write               10.128.4.7    346155      0     0        0

Comment 4 Eric Matysek 2020-06-09 17:43:29 UTC
Created attachment 1696360 [details]
pod logs

Attached pod logs

Comment 5 Eric Matysek 2020-06-09 17:46:26 UTC
(In reply to Lukas Vlcek from comment #2)
> Could we get logging dump from this cluster please?
> I am interested in number of rejected bulk requests.

I wouldn't expect rejected bulk requests because I do not believe this is an issue with elasticsearch not being able to keep up with incoming messages. During these tests the elasticsearch pods are only using a fraction of the available CPU

Comment 6 Eric Matysek 2020-06-09 18:35:53 UTC
(In reply to Jeff Cantrill from comment #1)
> Targeting 4.6, Severity urgent.  @Eric, do you have a test where maybe we
> can write the logs to /dev/null or someplace other then Elastic to baseline
> the collector.  My expectation is the difference between 4.4 and 4.5 from a
> collection stand point would be no different since it is the same config and
> fluent libraries

I'm happy to look into that, but if we aren't writing to elasticsearch and are essentially throwing away the log messages, how will we know if we successfully shipped all the logs afterward?
I agree with your assessment, I don't think the issues we are seeing fall on fluentd, unless something underlying changed in kubelet configs.

Comment 7 Jeff Cantrill 2020-06-10 13:23:42 UTC
(In reply to Eric Matysek from comment #6)
> (In reply to Jeff Cantrill from comment #1)
> > Targeting 4.6, Severity urgent.  @Eric, do you have a test where maybe we
> > can write the logs to /dev/null or someplace other then Elastic to baseline
> > the collector.  My expectation is the difference between 4.4 and 4.5 from a
> > collection stand point would be no different since it is the same config and
> > fluent libraries
> 
> I'm happy to look into that, but if we aren't writing to elasticsearch and
> are essentially throwing away the log messages, how will we know if we
> successfully shipped all the logs afterward?
> I agree with your assessment, I don't think the issues we are seeing fall on
> fluentd, unless something underlying changed in kubelet configs.

It speaks to isolating the problem.  In 4.6 we will release LogForwarding so we are interested in collector performance and characteristics assuming the happy path. We have customers who may never even use Elasticsearch

Dropped logs can be caused from:

* Lost logs in the fluent pipeline because we misconfigured something
* Back pressure from the store causing logs to be recycled back trough and maybe dropped
* Fluent never reading the logs because the platform rotated the files its reading out from under it.

The question here is "where" are they lost?  It may be they were never read.

Comment 8 Periklis Tsirakidis 2020-07-06 07:20:56 UTC
Put UpcomingSpring label, as ongoing investigation is not conclusive yet.

Comment 9 Jeff Cantrill 2020-07-08 19:22:39 UTC

*** This bug has been marked as a duplicate of bug 1833486 ***

Comment 10 Jeff Cantrill 2020-07-08 19:23:11 UTC
Closed as a dup since this is related to general performance of fluent being able to recognize log rotations

Comment 11 Red Hat Bugzilla 2023-09-14 06:01:51 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.