Bug 1414810

Summary:	Unexpected error against existing .operations index [action.bulk]
Product:	OpenShift Container Platform	Reporter:	Ricardo Lourenco <rlourenc>
Component:	Logging	Assignee:	Jeff Cantrill <jcantril>
Status:	CLOSED WONTFIX	QA Contact:	Xia Zhao <xiazhao>
Severity:	low	Docs Contact:
Priority:	medium
Version:	3.4.0	CC:	aos-bugs, jeder, lvlcek, pportant, rlourenc, rmeggins, tstclair
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	aos-scalability-34
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-02-16 15:50:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ricardo Lourenco 2017-01-19 14:10:04 UTC

Description of problem:

The .operations index for the given date exists, but the below is found in the es pod logs:

[2017-01-04 19:51:17,148][WARN ][action.bulk              ] [Jonathan "John" Garrett] unexpected error during the primary phase for action [indices:data/write/bulk[s]], request [BulkShardRequest to [.operations.2017.01.04] containing [88] requests]
[.operations.2017.01.04] IndexNotFoundException[no such index]


Version-Release number of selected component (if applicable):
3.4.0


Steps to Reproduce:
1. Install logging infra
2. Scale enrichment fluentd deploymentconfigs to 20 
3. Label 200 nodes with logging-infra-fluentd=true


Additional info:

Error log:
http://pbench.perf.lab.eng.bos.redhat.com/results/ocp34-master-1/erd_fluentd_256ll_5wppn_20_200_3ES_1h/log/oc/post_test/pods/oc_logs-logging-es-vjasi5h1-1-aqkiv.log

Confirmation that index exists:
http://pbench.perf.lab.eng.bos.redhat.com/results/ocp34-master-1/erd_fluentd_256ll_5wppn_20_200_3ES_1h/log/es/post_test/logging-es-vjasi5h1-1-aqkiv/_cat_recovery.log


Search for ".operations.2017.01.04"
http://pbench.perf.lab.eng.bos.redhat.com/results/ocp34-master-1/erd_fluentd_256ll_5wppn_20_200_3ES_1h/log/es/post_test/logging-es-vjasi5h1-1-aqkiv/_cluster_state.log

Comment 1 Rich Megginson 2017-01-20 16:17:11 UTC

Is there any missing data?  Can we see the fluentd logs to correlate that message with possibly something in the fluentd log?

Comment 2 Peter Portante 2017-01-20 16:26:19 UTC

That sort of thing happens during a bulk operation when the bulk service fails to create the index automatically.  There is a good chance this is related to a SearchGuard behavior preventing the new index from being created.

What version of Elasticsearch here?  I am going to guess 2.x.

Also, with regard to fluentd logs, if the are flowing in the same event stream as the bulk operation, that will be a problem.

Comment 3 Jeff Cantrill 2017-01-23 18:42:41 UTC

@Peter, can you clarify what "..are flowing in the same event stream.." means?  Is this something we can modify

Comment 4 Peter Portante 2017-01-24 00:06:00 UTC

(In reply to Jeff Cantrill from comment #3)
> @Peter, can you clarify what "..are flowing in the same event stream.."
> means?  Is this something we can modify

Sorry, I should have said: "If log messages from the fluentd processes themselves are flowing into the same event stream as other logs from the system in which that fluentd process is running, then we might not even see the logs from fluentd since they would potentially be dropped along with other logs that don't get index.

Does that make sense?

Another reason we need a secondary data path for the logs from the logging infrastructure itself.

Comment 5 Ricardo Lourenco 2017-01-24 15:47:18 UTC

@(In reply to Rich Megginson from comment #1)
> Is there any missing data?  Can we see the fluentd logs to correlate that
> message with possibly something in the fluentd log?

I see some of the fluentd pods having to retry connection to the enrichment-fluentd service. Although most of them seem to be able to recover the connection.

disconnected host="logging-enrichment-fluentd" port=24284

http://pbench.perf.lab.eng.bos.redhat.com/results/ocp34-master-1/erd_fluentd_256ll_5wppn_20_200_3ES_1h/log/oc/post_test/pods/oc_logs-logging-fluentd-kutj4.log

http://pbench.perf.lab.eng.bos.redhat.com/results/ocp34-master-1/erd_fluentd_256ll_5wppn_20_200_3ES_1h/log/oc/post_test/pods/oc_logs-logging-fluentd-prwf3.log

http://pbench.perf.lab.eng.bos.redhat.com/results/ocp34-master-1/erd_fluentd_256ll_5wppn_20_200_3ES_1h/log/oc/post_test/pods/oc_logs-logging-fluentd-vm94g.log

Comment 6 Jeff Cantrill 2017-02-02 20:19:20 UTC

Lukas do you have any input regarding this issue.  It doesn't seem we have any data loss.  Do you think we may wish to just document and move on?  Can we lower the priority as it does not seem to be a blocker. I am not sure what there is for us to do unless you have ideas

Comment 7 Jeff Cantrill 2017-02-08 01:05:28 UTC

Ricardo,

Can you comment on the priority/severity of this issue?  It seems we have no data loss and fluentd is able to recover.  We are looking to move this off the blocker list.

Comment 8 Ricardo Lourenco 2017-02-08 10:15:17 UTC

@Peter, yes this is 2.x

@Jeff,

Both the pre-test and post-test logs show the same and "during the primary phase for action" might mean this happens during initial sharding operations.

It looks like an index was deleted while a bulk request was still in flight, which is very possible given that the test harness deletes all indices right in the beginning of the test in order to start recording data with "clean" indices.

Given that this is a [WARN] message which doesn't seem to repeat much in the logs I wouldn't say it is a blocker and can be lower prio.

Comment 9 Peter Portante 2017-02-08 13:14:44 UTC

Ah, so then this seems to be a bug with the test harness, right?

What is the requirement that indices be deleted before a test?  All logs for the test should have a timestamp that properly aligns with the test.  If you have a requirement to test how logging is handle from the initial instance, we should probably review that, because most problems in the field will come while the environment is in full use, no?

If we do need to start from scratch, then can't the test harness first turn off all the fluentd pods, then delete the indices, then turn the fluentd pods back on?

I agree based on this information that this is not a priority or a bug with OpenShift.

Comment 10 Ricardo Lourenco 2017-02-08 13:52:39 UTC

rMost likely, yes.
This was added as suggested by Lukas. At the beginning of each test, we delete all data in the indices from the previous runs, also to clear or have ES optimize any extra disk space used by those indices.

Comment 11 Peter Portante 2017-02-08 14:13:52 UTC

Does not seem realistic.

Perhaps we should curate indices at 1 week, run tests for one week to fill the environment, measuring as we go, and then take measurements when at capacity?

Comment 12 Jeff Cantrill 2017-02-08 14:34:15 UTC

Lowering the priority as not a blocker and there is no data loss

Comment 13 Jeff Cantrill 2017-02-08 14:35:12 UTC

Lukas, looking for your input here.

Comment 14 Lukas Vlcek 2017-02-08 14:47:56 UTC

Seems I did not realise that you use shared ES cluster instance for several parallel/consecutive tests? Can you please explain me how and when you re-use the ES cluster for several tests then?

My recommendation was to always start with clean fresh ES cluster for each test to minimize the risk we interpret results incorrectly and make sure things like cache evictions, heavy GC cycles ... etc are not impacted by use of the cluster in previous tests...

Comment 15 Peter Portante 2017-02-08 15:04:09 UTC

(In reply to Lukas Vlcek from comment #14)
> Seems I did not realise that you use shared ES cluster instance for several
> parallel/consecutive tests? Can you please explain me how and when you
> re-use the ES cluster for several tests then?

Not sure this is about parallel testing or consecutive testing.  If fluentd pods are always running, and one deletes all the indices that they are writing to at the time, then there is always a chance a delete will take place during a bulk index operation, causing this problem.

Deleting indices while fluentd pods are running seems like a really bad idea.

> 
> My recommendation was to always start with clean fresh ES cluster for each
> test to minimize the risk we interpret results incorrectly and make sure
> things like cache evictions, heavy GC cycles ... etc are not impacted by use
> of the cluster in previous tests...

Certainly that is one mode of testing, but I am not sure how useful that is.  How often will a customer have a pristine environment?  If we measure logging in that environment, how can we compare that to what customers encounter in the live scenario?

It seems we need to rethink this testing strategy.

Comment 17 Lukas Vlcek 2017-02-08 15:22:28 UTC

> Deleting indices while fluentd pods are running seems like a really bad idea.

I agree. I did not think about this context then I thought we can control it very precisely.

> How often will a customer have a pristine environment?  If we measure logging in that environment, how can we compare that to what customers encounter in the live scenario?

I understand. But we should be able to test with clean fresh ES as well. For example we want to be able to verify that under specific condition we can guarantee 100% logs delivery and indexing. This means we want the "total.doc.count = sum of all generated logs on all pods". For this type of tests we might need improve our testing environment?

Comment 18 Peter Portante 2017-02-08 15:59:30 UTC

(In reply to Lukas Vlcek from comment #17)
> 
> I understand. But we should be able to test with clean fresh ES as well. For
> example we want to be able to verify that under specific condition we can
> guarantee 100% logs delivery and indexing. This means we want the
> "total.doc.count = sum of all generated logs on all pods". For this type of
> tests we might need improve our testing environment?

We just need to turn off fluentd pods before we delete all indicies, no?

Comment 21 Ricardo Lourenco 2017-02-08 16:30:15 UTC

(In reply to Lukas Vlcek from comment #14)
> Seems I did not realise that you use shared ES cluster instance for several
> parallel/consecutive tests? Can you please explain me how and when you
> re-use the ES cluster for several tests then?
> 
> My recommendation was to always start with clean fresh ES cluster for each
> test to minimize the risk we interpret results incorrectly and make sure
> things like cache evictions, heavy GC cycles ... etc are not impacted by use
> of the cluster in previous tests...

I didn't run any parallel tests. Before each test starts, the indices are deleted so we don't have data from 10 pods per node vs 30 pods per node.
The test does not scale down the fluentd's to 0 after doing that, so a delete operation might be taking place while the fluentd's are already trying to send data.

Comment 28 Rich Megginson 2017-02-13 18:51:20 UTC

If the problem here is deleting indices that are still in use, I would like to close this bug as WONTFIX.  Does everyone agree?

Comment 29 Peter Portante 2017-02-14 02:45:04 UTC

I don't believe this is a bug.  If "won't fix" implies that it is a bug, is there another category to use, like invalid or something?

Comment 30 Jeff Cantrill 2017-02-16 15:50:22 UTC

Closing since there seems to be agreement that this is an unlikely and invalid scenario for a customer.