Created attachment 1712589 [details] Pod logs and other info Description of problem: Missing messages in Elasticsearch while testing logging throughput with a higher number of projects (100) Version-Release number of selected component (if applicable): 4.6 How reproducible: 100% Steps to Reproduce: 1. Deploy cluster with 3 nodes for ES and 10 nodes for logtest projects 2. Deploy 4.6 logging from respective release-4.6 branches 3. Run logging workload (https://github.com/RH-ematysek/workloads/tree/logtest_v45) with: LABEL_ALL_NODES=True NUM_PROJECTS=100 NUM_LINES=90000 RATE=3000 5k/s total throughput across 100 projects for 30 mins Actual results: fatal: [jump]: FAILED! => { "assertion": "480050 == 9000000", "changed": false, "evaluated_to": false, "msg": "Assertion failed" } Expected results: 9M messages indexed Additional info: I am able to push 10k/s logs when split across just 10 projects with same hardware setup. LABEL_ALL_NODES=True NUM_PROJECTS=10 NUM_LINES=1800000 RATE=60000 Testing 30 projects on a single node to verify it is not fluentd having an issue also passes my testing. LABEL_ALL_NODES=False NUM_PROJECTS=30 NUM_LINES=90000 RATE=3000
Created attachment 1712591 [details] Logging must-gather
(In reply to Eric Matysek from comment #0) > Created attachment 1712589 [details] Curious as must gather did help me here; there literally is no useful information in the fluent logs. Also, we are missing the configmap for fluent which we need to correct. > > Expected results: > 9M messages indexed > > Additional info: > I am able to push 10k/s logs when split across just 10 projects with same > hardware setup. LABEL_ALL_NODES=True NUM_PROJECTS=10 NUM_LINES=1800000 > RATE=60000 It would be interesting to understand exactly the number of generator pods that land on any given node. LABEL_ALL_NODES=True does not necessarily ensure you don't have multiple pods landing on the same node. Using the referenced scripts for the single pod test, we know the best we can achieve is 2500 m/s. We know if the sum total rate of pods running on any given node exceed that value then we likely will miss log rotations and end with message loss.
I added a step in my workload to ouput node placement in a file and checked it after the test, in the test referenced above I had at most 12 projects (600 msg/s) assigned to one node. In other tests I saw up to 20 projects assigned to a single node but that is still within the limits of my single node testing.
*** Bug 1877818 has been marked as a duplicate of this bug. ***
Moving to UpcomingRelease
It's possible that this issue causes the warning-level FluentdQueueLengthBurst alert to file.
Marking UpcomingSprint as will not be merged or addressed by EOD
Setting UpcomingSprint as unable to resolve before EOD
Closing in favor of larger, solution captured here https://issues.redhat.com/browse/LOG-1179 to be fixed in future
Re-opening this temporarily @jcantril - Looks like the JIRA ticket linked in comment 13 has been deleted. Can you add whatever tickets are tracking this issue to this bz? Thanks.
This ticket still exists. Should be open for Redhat internal