Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1629069

Summary:

All nodes in elasticsearch cluster not busy simultaneously - busy in sequence when logging from 2000 nodes

Product:

OpenShift Container Platform

Reporter:

Mike Fiedler <mifiedle>

Component:

Logging

Assignee:

Jeff Cantrill <jcantril>

Status:

CLOSED WORKSFORME

QA Contact:

Anping Li <anli>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

3.11.0

CC:

aos-bugs, rmeggins

Target Milestone:

---

Target Release:

3.11.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

aos-scalability-311

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-04-22 19:44:56 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
logging-es logs	none

Description Mike Fiedler 2018-09-14 19:19:08 UTC

Created attachment 1483387 [details]
logging-es logs

Description of problem:

I need to investigate this more but have to document what I am seeing.  Same environment as https://bugzilla.redhat.com/show_bug.cgi?id=1629015 - 3.11 scalability with 2000 nodes and a 3 node ES cluster on m4.10xlarge equivalent nodes.   ES pods have a 61GB memory limit.

I started a run of 2000 pods across 1000 namespaces logging at a rate of 100 messages/second per pod.   Watching the bulk data stats it was very apparent that the run was split into thirds with only 1 ES server active at a time.  Additionally it took a very long time after the last message was logged for all messages to be indexed - the ES servers were active for 30 minutes  after the last message was sent.  

The columns are bulk requests active, queued and rejected.  These are representative samples, not complete data.

For the first third of the test the bulk stats looked like this:

Fri Sep 14 14:22:02 EDT 2018
logging-es-data-master-mf4gygmf bulk                 0   0   898                                                                                                                                                   
logging-es-data-master-40vugi05 bulk                32 200 55943                                                                                                                                                   
logging-es-data-master-iynpt3ho bulk                 0   0   184                                                                                                                                                   
Fri Sep 14 14:22:32 EDT 2018                                                                                                                                                                                       
logging-es-data-master-mf4gygmf bulk                 0   0   898                                                                                                                                                   
logging-es-data-master-40vugi05 bulk                32 200 59499                                                                                                                                                   
logging-es-data-master-iynpt3ho bulk                 0   0   184                                                                                                                                                   
Fri Sep 14 14:23:03 EDT 2018                                                                                                                                                                                       
logging-es-data-master-mf4gygmf bulk                 1   0   898                                                                                                                                                   
logging-es-data-master-40vugi05 bulk                32 199 62720                                                                                                                                                   
logging-es-data-master-iynpt3ho bulk                 0   0   184


For the second part:

Fri Sep 14 14:40:08 EDT 2018                                                                                                                                                                                       
logging-es-data-master-mf4gygmf bulk                 5   0   1798                                                                                                                                                  
logging-es-data-master-40vugi05 bulk                 5   0 106046                                                                                                                                                  
logging-es-data-master-iynpt3ho bulk                32 199   2985                                                                                                                                                  
Fri Sep 14 14:40:39 EDT 2018                                                                                                                                                                                       
logging-es-data-master-mf4gygmf bulk                 0   0   1798                                                                                                                                                  
logging-es-data-master-40vugi05 bulk                 5   0 106046                                                                                                                                                  
logging-es-data-master-iynpt3ho bulk                32 200   3419                                                                                                                                                  
Fri Sep 14 14:41:10 EDT 2018                                                                                                                                                                                       
logging-es-data-master-mf4gygmf bulk                 0   0   1798                                                                                                                                                  
logging-es-data-master-40vugi05 bulk                 2   0 106046                                                                                                                                                  
logging-es-data-master-iynpt3ho bulk                32 200   3894 

For the third part:

Fri Sep 14 14:58:09 EDT 2018
logging-es-data-master-mf4gygmf bulk                32 198   6241
logging-es-data-master-40vugi05 bulk                 7   0 106046
logging-es-data-master-iynpt3ho bulk                 1   0  10579
Fri Sep 14 14:58:40 EDT 2018
logging-es-data-master-mf4gygmf bulk                32 198   6632
logging-es-data-master-40vugi05 bulk                 3   0 106046
logging-es-data-master-iynpt3ho bulk                 2   0  10579
Fri Sep 14 14:59:11 EDT 2018
logging-es-data-master-mf4gygmf bulk                32 198   7014
logging-es-data-master-40vugi05 bulk                 7   0 106046
logging-es-data-master-iynpt3ho bulk                 2   0  10579

There are no ES pod restarts and no readiness failures.


In most previous runs before getting into this state, all servers showed as being active at the same time.   

Version-Release number of selected component (if applicable): 3.11.0-0.28.0


How reproducible: Unknown - the ES cluster is stuck in this state right now.   After documenting this I'll try bouncing ES and all fluentd.



Additional info:

Attaching logging-es logs but I didn't see anything obvious.  

I have pbench data but have not processed it yet.  I will add links in a private comment.

Comment 1 Rich Megginson 2018-09-14 19:50:27 UTC

The columns are bulk requests active, queued and rejected. These are representative samples, not complete data.

For the first third of the test the bulk stats looked like this:

Fri Sep 14 14:22:02 EDT 2018
logging-es-data-master-mf4gygmf bulk 0 0 898
logging-es-data-master-40vugi05 bulk 32 200 55943

This means the server had 55943 rejected bulk indexing operations????? You might want to also capture the number of bulk operations completed (bc) so that we can compare the number of ops successfully completed vs. number which were rejected e.g. I'd like to know for logging-es-data-master-mf4gygmf how many successful operations there were to compare against the 898 bulk rejections.

But yes, it is troubling that the load is not being evenly spread out among the cluster. The connection reloading functionality in fluentd should help with this _as long as the haproxy in the logging-es Service is working correctly_. Can we get haproxy statistics from the Service to make sure it is load balancing the connections among all 3 es?

The other problem may be sharding - the optimal setting for sharding here would be 3 shards per index e.g. https://github.com/richm/docs/blob/master/increase-number-of-shards-for-index-template.md

However, that would only help if there are only 1 or 2 indices which have a large number of documents, and the other indices have almost no documents. How many namespaces are there in this test? Are there load pods running in all of those namespaces? Are they generating roughly the same amount of traffic?

Comment 2 Mike Fiedler 2018-09-14 20:22:03 UTC

I will repeat starting with fresh ES pods and capture bulk completed. and shards set to 3 per index.

> How many namespaces are there in this test? 

1000

>Are there load pods running in all of those namespaces?  Are they generating roughly the same amount of traffic?

Yes and Yes - exactly the same amount of traffic.

Comment 3 Mike Fiedler 2018-09-20 01:06:34 UTC

Changing the shards to 3 made a huge difference.  I'm seeing even distribution of bulk completions (and rejections) over the ES cluster.   At 50 messages/second/node x 2000 nodes = 100,000 1K messages/second total for the cluster, bulk rejections are running 15-20% of completions.   Going to back the rate off some to try to find a sustainable rate with a lower number of bulk rejections.   

Also trying to reproduce https://bugzilla.redhat.com/show_bug.cgi?id=1629015 with fluentd debug on.

Comment 4 Rich Megginson 2018-09-20 02:36:04 UTC

(In reply to Mike Fiedler from comment #3)
> Changing the shards to 3 made a huge difference.  I'm seeing even
> distribution of bulk completions (and rejections) over the ES cluster.   At
> 50 messages/second/node x 2000 nodes = 100,000 1K messages/second total for
> the cluster, bulk rejections are running 15-20% of completions.   

Thanks!  Yes, this is the "high performance but unsafe" configuration - if you lose a node, you lose 1/3rd of your data, so better backup your /elasticsearch partition somewhere . . .

> Going to
> back the rate off some to try to find a sustainable rate with a lower number
> of bulk rejections.   

Yes, it would be nice to find the cutoff to say when to add more infra/es nodes.

> 
> Also trying to reproduce https://bugzilla.redhat.com/show_bug.cgi?id=1629015
> with fluentd debug on.

ok, thanks

Comment 6 Rich Megginson 2019-02-01 10:07:32 UTC

https://github.com/richm/docs/releases/tag/20180726172001

Comment 7 Jeff Cantrill 2019-04-22 19:44:56 UTC

We can resolve by modifying the shard count to the number of data nodes which will affect every shard in the cluster [1].  Alternatively users can follow the instructions from #c6 to target the operations indices specifically.

Closing as WORSKSFORME.

[1] https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_logging_elasticsearch/tasks/main.yaml#L473