1872465 – Missing messages while testing logging at scale

Bug 1872465 - Missing messages while testing logging at scale

Summary: Missing messages while testing logging at scale

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.z
Assignee:	Jeff Cantrill
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:	logging-core
Duplicates (1):	1877818 (view as bug list)
Depends On:
Blocks:	1847027
TreeView+	depends on / blocked

Reported:	2020-08-25 20:28 UTC by Eric Matysek
Modified:	2021-04-27 13:19 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-04-27 13:19:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Pod logs and other info (100.33 KB, application/gzip) 2020-08-25 20:28 UTC, Eric Matysek	no flags	Details
Logging must-gather (1.71 MB, application/gzip) 2020-08-25 20:30 UTC, Eric Matysek	no flags	Details
View All

Description Eric Matysek 2020-08-25 20:28:26 UTC

Created attachment 1712589 [details]
Pod logs and other info

Description of problem:
Missing messages in Elasticsearch while testing logging throughput with a higher number of projects (100)

Version-Release number of selected component (if applicable):
4.6

How reproducible:
100%

Steps to Reproduce:
1. Deploy cluster with 3 nodes for ES and 10 nodes for logtest projects 
2. Deploy 4.6 logging from respective release-4.6 branches
3. Run logging workload (https://github.com/RH-ematysek/workloads/tree/logtest_v45) with:
LABEL_ALL_NODES=True NUM_PROJECTS=100 NUM_LINES=90000 RATE=3000
5k/s total throughput across 100 projects for 30 mins

Actual results:
fatal: [jump]: FAILED! => {
    "assertion": "480050 == 9000000",
    "changed": false,
    "evaluated_to": false,
    "msg": "Assertion failed"
}


Expected results:
9M messages indexed

Additional info:
I am able to push 10k/s logs when split across just 10 projects with same hardware setup. LABEL_ALL_NODES=True NUM_PROJECTS=10 NUM_LINES=1800000 RATE=60000
Testing 30 projects on a single node to verify it is not fluentd having an issue also passes my testing. LABEL_ALL_NODES=False NUM_PROJECTS=30 NUM_LINES=90000 RATE=3000

Comment 1 Eric Matysek 2020-08-25 20:30:14 UTC

Created attachment 1712591 [details]
Logging must-gather

Comment 2 Jeff Cantrill 2020-08-26 00:43:32 UTC

(In reply to Eric Matysek from comment #0)
> Created attachment 1712589 [details]

Curious as must gather did help me here; there literally is no useful information in the fluent logs.  Also, we are missing the configmap for fluent which we need to correct.

> 
> Expected results:
> 9M messages indexed
> 
> Additional info:
> I am able to push 10k/s logs when split across just 10 projects with same
> hardware setup. LABEL_ALL_NODES=True NUM_PROJECTS=10 NUM_LINES=1800000
> RATE=60000

It would be interesting to understand exactly the number of generator pods that land on any given node.  LABEL_ALL_NODES=True does not necessarily ensure you don't have multiple pods landing on the same node.  Using the referenced scripts for the single pod test, we know the best we can achieve is 2500 m/s.  We know if the sum total rate of pods running on any given node exceed that value then we likely will miss log rotations and end with message loss.

Comment 3 Eric Matysek 2020-08-26 17:54:56 UTC

I added a step in my workload to ouput node placement in a file and checked it after the test, in the test referenced above I had at most 12 projects (600 msg/s) assigned to one node. In other tests I saw up to 20 projects assigned to a single node but that is still within the limits of my single node testing.

Comment 4 Jeff Cantrill 2020-09-10 15:52:57 UTC

*** Bug 1877818 has been marked as a duplicate of this bug. ***

Comment 5 Jeff Cantrill 2020-09-11 20:18:41 UTC

Moving to UpcomingRelease

Comment 7 W. Trevor King 2020-09-28 19:43:56 UTC

It's possible that this issue causes the warning-level FluentdQueueLengthBurst alert to file.

Comment 9 Jeff Cantrill 2020-10-02 15:24:09 UTC

Marking UpcomingSprint as will not be merged or addressed by EOD

Comment 10 Jeff Cantrill 2020-10-23 15:20:12 UTC

Setting UpcomingSprint as unable to resolve before EOD

Comment 13 Jeff Cantrill 2021-03-02 17:03:44 UTC

Closing in favor of larger, solution captured here https://issues.redhat.com/browse/LOG-1179 to be fixed in future

Comment 14 Mike Fiedler 2021-03-04 15:30:06 UTC

Re-opening this temporarily

@jcantril - Looks like the JIRA ticket linked in comment 13 has been deleted.   Can you add whatever tickets are tracking this issue to this bz?  Thanks.

Comment 15 Jeff Cantrill 2021-04-27 13:19:24 UTC

This ticket still exists.  Should be open for Redhat internal

Note You need to log in before you can comment on or make changes to this bug.