Description of problem: While testing max logging throughput for elasticsearch I am seeing missing messages at logging rates that are much lower than expected. Version-Release number of selected component (if applicable): 4.5 How reproducible: 100% Steps to Reproduce: 1. Deploy logging 2. create 100 projects logging at 100 msgs/sec 3. Check total documents in app index Actual results: All projects were set to logg 360,000 messages 4 projects are missing messages: $ 'https://localhost:9200/app*/_count?pretty' -H 'Content-Type: application/json' -d '{"query":{"term":{"kubernetes.namespace_name":"logtest10"}}}' | jq '.count' 321700 $ 'https://localhost:9200/app*/_count?pretty' -H 'Content-Type: application/json' -d '{"query":{"term":{"kubernetes.namespace_name":"logtest22"}}}' | jq '.count' 231100 $ 'https://localhost:9200/app*/_count?pretty' -H 'Content-Type: application/json' -d '{"query":{"term":{"kubernetes.namespace_name":"logtest28"}}}' | jq '.count' 275800 $ 'https://localhost:9200/app*/_count?pretty' -H 'Content-Type: application/json' -d '{"query":{"term":{"kubernetes.namespace_name":"logtest35"}}}' | jq '.count' 229250 Total messages indexed from all projects (should be 3.6m): $ 'https://localhost:9200/app*/_count?pretty' -H 'Content-Type: application/json' -d '{"query":{"wildcard":{"kubernetes.namespace_name":{"value":"logtest*","boost":1,"rewrite":"constant_score"}}}}' | jq '.count' 35617850 Expected results: Should be 360k messages per project and 4.6M messages total Additional info: CPU usage is roughly 2-2.5 per elasticsearch pod out of a reservation of 12 cores so I don't think this is due to a limitation of resources. A maximum of 7 of these projects were assigned to a single node for this test so we should be well within the capabilities of fluentd
Targeting 4.6, Severity urgent. @Eric, do you have a test where maybe we can write the logs to /dev/null or someplace other then Elastic to baseline the collector. My expectation is the difference between 4.4 and 4.5 from a collection stand point would be no different since it is the same config and fluent libraries
Could we get logging dump from this cluster please? I am interested in number of rejected bulk requests.
Here are the threads after a failed scale workload: node_name name host completed active queue rejected elasticsearch-cdm-dxnd91j5-1 analyze 10.130.2.6 0 0 0 0 elasticsearch-cdm-dxnd91j5-1 fetch_shard_started 10.130.2.6 0 0 0 0 elasticsearch-cdm-dxnd91j5-1 fetch_shard_store 10.130.2.6 0 0 0 0 elasticsearch-cdm-dxnd91j5-1 flush 10.130.2.6 897 0 0 0 elasticsearch-cdm-dxnd91j5-1 force_merge 10.130.2.6 0 0 0 0 elasticsearch-cdm-dxnd91j5-1 generic 10.130.2.6 9028 0 0 0 elasticsearch-cdm-dxnd91j5-1 get 10.130.2.6 1779 0 0 0 elasticsearch-cdm-dxnd91j5-1 index 10.130.2.6 0 0 0 0 elasticsearch-cdm-dxnd91j5-1 listener 10.130.2.6 0 0 0 0 elasticsearch-cdm-dxnd91j5-1 management 10.130.2.6 197426 1 0 0 elasticsearch-cdm-dxnd91j5-1 refresh 10.130.2.6 25722 0 0 0 elasticsearch-cdm-dxnd91j5-1 search 10.130.2.6 7 0 0 0 elasticsearch-cdm-dxnd91j5-1 search_throttled 10.130.2.6 0 0 0 0 elasticsearch-cdm-dxnd91j5-1 snapshot 10.130.2.6 0 0 0 0 elasticsearch-cdm-dxnd91j5-1 warmer 10.130.2.6 65323 0 0 0 elasticsearch-cdm-dxnd91j5-1 write 10.130.2.6 346215 1 0 0 elasticsearch-cdm-dxnd91j5-2 analyze 10.131.2.5 0 0 0 0 elasticsearch-cdm-dxnd91j5-2 fetch_shard_started 10.131.2.5 0 0 0 0 elasticsearch-cdm-dxnd91j5-2 fetch_shard_store 10.131.2.5 0 0 0 0 elasticsearch-cdm-dxnd91j5-2 flush 10.131.2.5 896 0 0 0 elasticsearch-cdm-dxnd91j5-2 force_merge 10.131.2.5 0 0 0 0 elasticsearch-cdm-dxnd91j5-2 generic 10.131.2.5 2718 0 0 0 elasticsearch-cdm-dxnd91j5-2 get 10.131.2.5 1772 0 0 0 elasticsearch-cdm-dxnd91j5-2 index 10.131.2.5 0 0 0 0 elasticsearch-cdm-dxnd91j5-2 listener 10.131.2.5 0 0 0 0 elasticsearch-cdm-dxnd91j5-2 management 10.131.2.5 195404 1 0 0 elasticsearch-cdm-dxnd91j5-2 refresh 10.131.2.5 19822 0 0 0 elasticsearch-cdm-dxnd91j5-2 search 10.131.2.5 7 0 0 0 elasticsearch-cdm-dxnd91j5-2 search_throttled 10.131.2.5 0 0 0 0 elasticsearch-cdm-dxnd91j5-2 snapshot 10.131.2.5 0 0 0 0 elasticsearch-cdm-dxnd91j5-2 warmer 10.131.2.5 65293 0 0 0 elasticsearch-cdm-dxnd91j5-2 write 10.131.2.5 346268 0 0 0 elasticsearch-cdm-dxnd91j5-3 analyze 10.128.4.7 0 0 0 0 elasticsearch-cdm-dxnd91j5-3 fetch_shard_started 10.128.4.7 0 0 0 0 elasticsearch-cdm-dxnd91j5-3 fetch_shard_store 10.128.4.7 0 0 0 0 elasticsearch-cdm-dxnd91j5-3 flush 10.128.4.7 891 0 0 0 elasticsearch-cdm-dxnd91j5-3 force_merge 10.128.4.7 0 0 0 0 elasticsearch-cdm-dxnd91j5-3 generic 10.128.4.7 8560 0 0 0 elasticsearch-cdm-dxnd91j5-3 get 10.128.4.7 0 0 0 0 elasticsearch-cdm-dxnd91j5-3 index 10.128.4.7 0 0 0 0 elasticsearch-cdm-dxnd91j5-3 listener 10.128.4.7 0 0 0 0 elasticsearch-cdm-dxnd91j5-3 management 10.128.4.7 195572 1 0 0 elasticsearch-cdm-dxnd91j5-3 refresh 10.128.4.7 19820 0 0 0 elasticsearch-cdm-dxnd91j5-3 search 10.128.4.7 6 0 0 0 elasticsearch-cdm-dxnd91j5-3 search_throttled 10.128.4.7 0 0 0 0 elasticsearch-cdm-dxnd91j5-3 snapshot 10.128.4.7 0 0 0 0 elasticsearch-cdm-dxnd91j5-3 warmer 10.128.4.7 65039 0 0 0 elasticsearch-cdm-dxnd91j5-3 write 10.128.4.7 346155 0 0 0
Created attachment 1696360 [details] pod logs Attached pod logs
(In reply to Lukas Vlcek from comment #2) > Could we get logging dump from this cluster please? > I am interested in number of rejected bulk requests. I wouldn't expect rejected bulk requests because I do not believe this is an issue with elasticsearch not being able to keep up with incoming messages. During these tests the elasticsearch pods are only using a fraction of the available CPU
(In reply to Jeff Cantrill from comment #1) > Targeting 4.6, Severity urgent. @Eric, do you have a test where maybe we > can write the logs to /dev/null or someplace other then Elastic to baseline > the collector. My expectation is the difference between 4.4 and 4.5 from a > collection stand point would be no different since it is the same config and > fluent libraries I'm happy to look into that, but if we aren't writing to elasticsearch and are essentially throwing away the log messages, how will we know if we successfully shipped all the logs afterward? I agree with your assessment, I don't think the issues we are seeing fall on fluentd, unless something underlying changed in kubelet configs.
(In reply to Eric Matysek from comment #6) > (In reply to Jeff Cantrill from comment #1) > > Targeting 4.6, Severity urgent. @Eric, do you have a test where maybe we > > can write the logs to /dev/null or someplace other then Elastic to baseline > > the collector. My expectation is the difference between 4.4 and 4.5 from a > > collection stand point would be no different since it is the same config and > > fluent libraries > > I'm happy to look into that, but if we aren't writing to elasticsearch and > are essentially throwing away the log messages, how will we know if we > successfully shipped all the logs afterward? > I agree with your assessment, I don't think the issues we are seeing fall on > fluentd, unless something underlying changed in kubelet configs. It speaks to isolating the problem. In 4.6 we will release LogForwarding so we are interested in collector performance and characteristics assuming the happy path. We have customers who may never even use Elasticsearch Dropped logs can be caused from: * Lost logs in the fluent pipeline because we misconfigured something * Back pressure from the store causing logs to be recycled back trough and maybe dropped * Fluent never reading the logs because the platform rotated the files its reading out from under it. The question here is "where" are they lost? It may be they were never read.
Put UpcomingSpring label, as ongoing investigation is not conclusive yet.
*** This bug has been marked as a duplicate of bug 1833486 ***
Closed as a dup since this is related to general performance of fluent being able to recognize log rotations
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days