Bug 1561196

Summary: Update the logging role to use facts from current deployment in lieu of role defaults for ES memory limits
Product: OpenShift Container Platform Reporter: Peter Portante <pportant>
Component: LoggingAssignee: ewolinet
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: high    
Version: 3.9.0CC: anli, aos-bugs, ewolinet, jcantril, juzhao, pportant, rmeggins, sreber, tkatarki
Target Milestone: ---   
Target Release: 3.9.z   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Feature: In the absence of inventory values, reuse the values used for the current deployment to preserve sane/tuned values. Reason: In the case of Elasticsearch, when a customer had done tuning of the cluster but did not propagate those values into variables, upgrading logging would use role default values which may put the cluster in a bad state and lead to loss of log data. Result: We honor values in the order for EFK: inventory -> existing environment -> role defaults
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-17 06:43:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Peter Portante 2018-03-27 20:54:14 UTC
The Elasticsearch v2.x sizing guidelines [1] state that less than 8 GB ends up with too many small instances, with 64 GB being the sweet spot, but 32 GB and 16 GB being common sizes.

Let's update the default ES pod size to 16 GB (8 GB Java HEAP and 8 GB reserved for buffer cache) to stay in line with what is considered common.

[1] https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html#_memory

Comment 2 Peter Portante 2018-03-28 23:58:03 UTC
(In reply to Rich Megginson from comment #1)
> https://github.com/openshift/openshift-ansible/blob/master/roles/
> openshift_logging/defaults/main.yml#L102
> https://github.com/openshift/openshift-ansible/blob/master/roles/
> openshift_logging/defaults/main.yml#L139

Yes, thanks!

Comment 15 Junqi Zhao 2018-05-07 06:30:07 UTC
Deploy logging firstly and change fluentd nodeSelector to non-default value, logging-infra-test-fluentd=true

# oc get ds
NAME              DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
logging-fluentd   2         2         2         2            2           logging-infra-test-fluentd=true   10m

Update logging with the same inventory, fluentd nodeSelector would use the default nodeSelector logging-infra-fluentd=true, not get the existing nodeSelector from environment

# oc get ds
NAME              DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR                AGE
logging-fluentd   2         2         2         2            2           logging-infra-fluentd=true   15m


# rpm -qa | grep openshift-ansible
openshift-ansible-roles-3.9.28-1.git.0.4fc2ce4.el7.noarch
openshift-ansible-docs-3.9.28-1.git.0.4fc2ce4.el7.noarch
openshift-ansible-playbooks-3.9.28-1.git.0.4fc2ce4.el7.noarch
openshift-ansible-3.9.28-1.git.0.4fc2ce4.el7.noarch

Comment 17 Jeff Cantrill 2018-05-07 17:30:10 UTC
The reported BZ is specific to memory and cpu settings.  I am of the opinion that it should not block this test.  We should consider opening a separate BZ to resolve fluent related issues.

Comment 18 Junqi Zhao 2018-05-08 08:18:33 UTC
Tested, ES memory limits would get from existing deployment instead of using the defaults.

Polarion test case OCP-18917
# rpm -qa | grep openshift-ansible
openshift-ansible-roles-3.9.27-1.git.0.52e35b5.el7.noarch
openshift-ansible-docs-3.9.27-1.git.0.52e35b5.el7.noarch
openshift-ansible-playbooks-3.9.27-1.git.0.52e35b5.el7.noarch
openshift-ansible-3.9.27-1.git.0.52e35b5.el7.noarch

Comment 19 Junqi Zhao 2018-05-08 08:49:36 UTC
Issue in Comment 15 is reported in bug 1575901

Comment 22 errata-xmlrpc 2018-05-17 06:43:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1566