Bug 1872465
Summary: | Missing messages while testing logging at scale | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Eric Matysek <ematysek> | ||||||
Component: | Logging | Assignee: | Jeff Cantrill <jcantril> | ||||||
Status: | CLOSED DEFERRED | QA Contact: | Anping Li <anli> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 4.6 | CC: | aos-bugs, dblack, jcantril, mfuruta, mifiedle, nelluri, steven.barre, wking | ||||||
Target Milestone: | --- | Keywords: | Regression, Reopened | ||||||
Target Release: | 4.7.z | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | logging-core | ||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2021-04-27 13:19:24 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1847027 | ||||||||
Attachments: |
|
Description
Eric Matysek
2020-08-25 20:28:26 UTC
Created attachment 1712591 [details]
Logging must-gather
(In reply to Eric Matysek from comment #0) > Created attachment 1712589 [details] Curious as must gather did help me here; there literally is no useful information in the fluent logs. Also, we are missing the configmap for fluent which we need to correct. > > Expected results: > 9M messages indexed > > Additional info: > I am able to push 10k/s logs when split across just 10 projects with same > hardware setup. LABEL_ALL_NODES=True NUM_PROJECTS=10 NUM_LINES=1800000 > RATE=60000 It would be interesting to understand exactly the number of generator pods that land on any given node. LABEL_ALL_NODES=True does not necessarily ensure you don't have multiple pods landing on the same node. Using the referenced scripts for the single pod test, we know the best we can achieve is 2500 m/s. We know if the sum total rate of pods running on any given node exceed that value then we likely will miss log rotations and end with message loss. I added a step in my workload to ouput node placement in a file and checked it after the test, in the test referenced above I had at most 12 projects (600 msg/s) assigned to one node. In other tests I saw up to 20 projects assigned to a single node but that is still within the limits of my single node testing. *** Bug 1877818 has been marked as a duplicate of this bug. *** Moving to UpcomingRelease It's possible that this issue causes the warning-level FluentdQueueLengthBurst alert to file. Marking UpcomingSprint as will not be merged or addressed by EOD Setting UpcomingSprint as unable to resolve before EOD Closing in favor of larger, solution captured here https://issues.redhat.com/browse/LOG-1179 to be fixed in future Re-opening this temporarily @jcantril - Looks like the JIRA ticket linked in comment 13 has been deleted. Can you add whatever tickets are tracking this issue to this bz? Thanks. This ticket still exists. Should be open for Redhat internal |