Bug 1703546 - Changing clusterlogging CR for ES does not trigger a new ES deployment in a timely fashion
Summary: Changing clusterlogging CR for ES does not trigger a new ES deployment in a t...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.1.0
Assignee: Josef Karasek
QA Contact: Anping Li
URL:
Whiteboard: aos-scalability-41,logging-core
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-26 16:35 UTC by Mike Fiedler
Modified: 2020-08-27 16:58 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:48:05 UTC
Target Upstream Version:


Attachments (Terms of Use)
logging operator logs and CR yaml (113 bytes, application/gzip)
2019-04-26 16:37 UTC, Mike Fiedler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift elasticsearch-operator pull 129 0 None closed Bug 1703546: Waiting for synced flush 2020-09-03 00:49:23 UTC
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:48:13 UTC

Description Mike Fiedler 2019-04-26 16:35:57 UTC
Description of problem:

Looks like bug 1692796 is back.   Same scenario - updating ES resource requests/limits in the clusterlogging CR does not trigger a redeployment of ES.

Like the previous bug, the clusterlogging operator notices the change and updates the elasticsearch CR, but the elasticsearch operator does not seem to notice the change.    Operator logs attached.

Version-Release number of selected component (if applicable):

OCP:  Changing ES resource limits in clusterlogging CR does not trigger a new ES deployment
clusterlogging-operator:  quay.io/openshift/origin-cluster-logging-operator@sha256:b1382e25f876aa6edc3748f9a22188082333935904dcd3c2a080870025bf76a3
elasticsearch-operator:  quay.io/openshift/origin-elasticsearch-operator@sha256:3da7e1adfe29e17a27e45e1c74b8881e8958e21b1a6c01ef352aeea340435fd3

Note:  for some reason upstream images for logging operators are being used on an OCP install


How reproducible: 1/1 so far


Steps to Reproduce:
1. Deploy Elasticsearch and Cluster Logging operators from OperatorHub
2. Create a cluster logging CR (default values)
3. Verify all is running well
4. Update the clusterlogging CR to change limits (yaml attached)
5. Verify ES CR is updated (yaml attached)

Actual results:

Elasticsearch is not redeployed with new limts


Expected results:

Elasticsearch is redeployed
Additional info:

Comment 1 Mike Fiedler 2019-04-26 16:37:35 UTC
Created attachment 1559255 [details]
logging operator logs and CR yaml

Comment 2 ewolinet 2019-04-29 22:58:27 UTC
Mike,

Are there any messages in the EO logs that refer to being unable to upgrade or waiting for a particular state?
I was unable to recreate this with an image built from master (earlier today on my local system).

Comment 3 Mike Fiedler 2019-04-30 00:10:32 UTC
I did not see the EO log move at all when I saved the updated clusterlogging CR.   Let me install from today's puddle and give it another try.   The EO logs are in the attachment on this bz in the meantime.

Comment 4 Mike Fiedler 2019-04-30 01:48:13 UTC
I re-ran this with the latest images as served by OperatorHub and I do see one message pop in the EO log.

The diff of the ES operator pod before/after the scenario is:

> time="2019-04-30T01:43:34Z" level=warning msg="Unable to perform synchronized flush: <nil>"

The diff of the cluster logging op logs before/after the scenario is:

> time="2019-04-30T01:42:57Z" level=info msg="Elasticsearch resources change found, updating elasticsearch"
> time="2019-04-30T01:43:16Z" level=info msg="Elasticsearch resources change found, updating elasticsearch"

Let me know if there is anything else that might help.

Comment 5 Jeff Cantrill 2019-04-30 12:41:01 UTC
Modified the summary to reflect the cause.  The issue is that it is taking a long time for the operator to respect the change because it forces Elasticsearch to do a sync'd flush.  This operation pushes inmemory data to disk which can take time depending on: cluster size, amount of data, ingestion rate, etc.

Comment 8 Mike Fiedler 2019-05-03 01:22:57 UTC
Verified on:

quay.io/openshift/origin-elasticsearch-operator@sha256:59d5e2e988573428c0474c96c25d0fc48e0f80f64b657e5e2618b4372239a605
version   4.1.0-0.nightly-2019-05-02-131943   True        False         6h51m   Cluster version is 4.1.0-0.nightly-2019-05-02-131943

elasticsearch redeployed with reasonable messages in the operator log

Comment 10 errata-xmlrpc 2019-06-04 10:48:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.