Bug 1872465

Summary:

Missing messages while testing logging at scale

Product:

OpenShift Container Platform

Reporter:

Eric Matysek <ematysek>

Component:

Logging

Assignee:

Jeff Cantrill <jcantril>

Status:

CLOSED DEFERRED

QA Contact:

Anping Li <anli>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.6

CC:

aos-bugs, dblack, jcantril, mfuruta, mifiedle, nelluri, steven.barre, wking

Target Milestone:

---

Keywords:

Regression, Reopened

Target Release:

4.7.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

logging-core

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-04-27 13:19:24 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1847027

Attachments:

Description	Flags
Pod logs and other info	none
Logging must-gather	none

Description Eric Matysek 2020-08-25 20:28:26 UTC

Created attachment 1712589 [details]
Pod logs and other info

Description of problem:
Missing messages in Elasticsearch while testing logging throughput with a higher number of projects (100)

Version-Release number of selected component (if applicable):
4.6

How reproducible:
100%

Steps to Reproduce:
1. Deploy cluster with 3 nodes for ES and 10 nodes for logtest projects 
2. Deploy 4.6 logging from respective release-4.6 branches
3. Run logging workload (https://github.com/RH-ematysek/workloads/tree/logtest_v45) with:
LABEL_ALL_NODES=True NUM_PROJECTS=100 NUM_LINES=90000 RATE=3000
5k/s total throughput across 100 projects for 30 mins

Actual results:
fatal: [jump]: FAILED! => {
    "assertion": "480050 == 9000000",
    "changed": false,
    "evaluated_to": false,
    "msg": "Assertion failed"
}


Expected results:
9M messages indexed

Additional info:
I am able to push 10k/s logs when split across just 10 projects with same hardware setup. LABEL_ALL_NODES=True NUM_PROJECTS=10 NUM_LINES=1800000 RATE=60000
Testing 30 projects on a single node to verify it is not fluentd having an issue also passes my testing. LABEL_ALL_NODES=False NUM_PROJECTS=30 NUM_LINES=90000 RATE=3000

Comment 1 Eric Matysek 2020-08-25 20:30:14 UTC

Created attachment 1712591 [details]
Logging must-gather

Comment 2 Jeff Cantrill 2020-08-26 00:43:32 UTC

(In reply to Eric Matysek from comment #0)
> Created attachment 1712589 [details]

Curious as must gather did help me here; there literally is no useful information in the fluent logs.  Also, we are missing the configmap for fluent which we need to correct.

> 
> Expected results:
> 9M messages indexed
> 
> Additional info:
> I am able to push 10k/s logs when split across just 10 projects with same
> hardware setup. LABEL_ALL_NODES=True NUM_PROJECTS=10 NUM_LINES=1800000
> RATE=60000

It would be interesting to understand exactly the number of generator pods that land on any given node.  LABEL_ALL_NODES=True does not necessarily ensure you don't have multiple pods landing on the same node.  Using the referenced scripts for the single pod test, we know the best we can achieve is 2500 m/s.  We know if the sum total rate of pods running on any given node exceed that value then we likely will miss log rotations and end with message loss.

Comment 3 Eric Matysek 2020-08-26 17:54:56 UTC

I added a step in my workload to ouput node placement in a file and checked it after the test, in the test referenced above I had at most 12 projects (600 msg/s) assigned to one node. In other tests I saw up to 20 projects assigned to a single node but that is still within the limits of my single node testing.

Comment 4 Jeff Cantrill 2020-09-10 15:52:57 UTC

*** Bug 1877818 has been marked as a duplicate of this bug. ***

Comment 5 Jeff Cantrill 2020-09-11 20:18:41 UTC

Moving to UpcomingRelease

Comment 7 W. Trevor King 2020-09-28 19:43:56 UTC

It's possible that this issue causes the warning-level FluentdQueueLengthBurst alert to file.

Comment 9 Jeff Cantrill 2020-10-02 15:24:09 UTC

Marking UpcomingSprint as will not be merged or addressed by EOD

Comment 10 Jeff Cantrill 2020-10-23 15:20:12 UTC

Setting UpcomingSprint as unable to resolve before EOD

Comment 13 Jeff Cantrill 2021-03-02 17:03:44 UTC

Closing in favor of larger, solution captured here https://issues.redhat.com/browse/LOG-1179 to be fixed in future

Comment 14 Mike Fiedler 2021-03-04 15:30:06 UTC

Re-opening this temporarily

@jcantril - Looks like the JIRA ticket linked in comment 13 has been deleted.   Can you add whatever tickets are tracking this issue to this bz?  Thanks.

Comment 15 Jeff Cantrill 2021-04-27 13:19:24 UTC

This ticket still exists.  Should be open for Redhat internal