Bug 1872465

Summary: Missing messages while testing logging at scale
Product: OpenShift Container Platform Reporter: Eric Matysek <ematysek>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED DEFERRED QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: high    
Version: 4.6CC: aos-bugs, dblack, jcantril, mfuruta, mifiedle, nelluri, steven.barre, wking
Target Milestone: ---Keywords: Regression, Reopened
Target Release: 4.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: logging-core
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-27 13:19:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1847027    
Attachments:
Description Flags
Pod logs and other info
none
Logging must-gather none

Description Eric Matysek 2020-08-25 20:28:26 UTC
Created attachment 1712589 [details]
Pod logs and other info

Description of problem:
Missing messages in Elasticsearch while testing logging throughput with a higher number of projects (100)

Version-Release number of selected component (if applicable):
4.6

How reproducible:
100%

Steps to Reproduce:
1. Deploy cluster with 3 nodes for ES and 10 nodes for logtest projects 
2. Deploy 4.6 logging from respective release-4.6 branches
3. Run logging workload (https://github.com/RH-ematysek/workloads/tree/logtest_v45) with:
LABEL_ALL_NODES=True NUM_PROJECTS=100 NUM_LINES=90000 RATE=3000
5k/s total throughput across 100 projects for 30 mins

Actual results:
fatal: [jump]: FAILED! => {
    "assertion": "480050 == 9000000",
    "changed": false,
    "evaluated_to": false,
    "msg": "Assertion failed"
}


Expected results:
9M messages indexed

Additional info:
I am able to push 10k/s logs when split across just 10 projects with same hardware setup. LABEL_ALL_NODES=True NUM_PROJECTS=10 NUM_LINES=1800000 RATE=60000
Testing 30 projects on a single node to verify it is not fluentd having an issue also passes my testing. LABEL_ALL_NODES=False NUM_PROJECTS=30 NUM_LINES=90000 RATE=3000

Comment 1 Eric Matysek 2020-08-25 20:30:14 UTC
Created attachment 1712591 [details]
Logging must-gather

Comment 2 Jeff Cantrill 2020-08-26 00:43:32 UTC
(In reply to Eric Matysek from comment #0)
> Created attachment 1712589 [details]

Curious as must gather did help me here; there literally is no useful information in the fluent logs.  Also, we are missing the configmap for fluent which we need to correct.

> 
> Expected results:
> 9M messages indexed
> 
> Additional info:
> I am able to push 10k/s logs when split across just 10 projects with same
> hardware setup. LABEL_ALL_NODES=True NUM_PROJECTS=10 NUM_LINES=1800000
> RATE=60000

It would be interesting to understand exactly the number of generator pods that land on any given node.  LABEL_ALL_NODES=True does not necessarily ensure you don't have multiple pods landing on the same node.  Using the referenced scripts for the single pod test, we know the best we can achieve is 2500 m/s.  We know if the sum total rate of pods running on any given node exceed that value then we likely will miss log rotations and end with message loss.

Comment 3 Eric Matysek 2020-08-26 17:54:56 UTC
I added a step in my workload to ouput node placement in a file and checked it after the test, in the test referenced above I had at most 12 projects (600 msg/s) assigned to one node. In other tests I saw up to 20 projects assigned to a single node but that is still within the limits of my single node testing.

Comment 4 Jeff Cantrill 2020-09-10 15:52:57 UTC
*** Bug 1877818 has been marked as a duplicate of this bug. ***

Comment 5 Jeff Cantrill 2020-09-11 20:18:41 UTC
Moving to UpcomingRelease

Comment 7 W. Trevor King 2020-09-28 19:43:56 UTC
It's possible that this issue causes the warning-level FluentdQueueLengthBurst alert to file.

Comment 9 Jeff Cantrill 2020-10-02 15:24:09 UTC
Marking UpcomingSprint as will not be merged or addressed by EOD

Comment 10 Jeff Cantrill 2020-10-23 15:20:12 UTC
Setting UpcomingSprint as unable to resolve before EOD

Comment 13 Jeff Cantrill 2021-03-02 17:03:44 UTC
Closing in favor of larger, solution captured here https://issues.redhat.com/browse/LOG-1179 to be fixed in future

Comment 14 Mike Fiedler 2021-03-04 15:30:06 UTC
Re-opening this temporarily

@jcantril - Looks like the JIRA ticket linked in comment 13 has been deleted.   Can you add whatever tickets are tracking this issue to this bz?  Thanks.

Comment 15 Jeff Cantrill 2021-04-27 13:19:24 UTC
This ticket still exists.  Should be open for Redhat internal