Bug 1879407 - The restart-cluster playbook doesn't take into account that openshift_logging_es_ops_cluster_size could be different from openshift_logging_es_cluster_size
Summary: The restart-cluster playbook doesn't take into account that openshift_logging...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.11.z
Assignee: ewolinet
QA Contact: Anping Li
URL:
Whiteboard: logging-exploration
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-16 08:49 UTC by Andy Bartlett
Modified: 2024-06-13 23:05 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The restart task did not evaluate using the defined cluster size for ops clusters Consequence: The restart would never complete Fix: Pass logging ops cluster size Result: Restarts of ops clusters complete as expected
Clone Of:
Environment:
Last Closed: 2021-03-03 12:27:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
The inventory and playbook logs (1.08 MB, application/gzip)
2021-02-20 12:59 UTC, Anping Li
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-ansible pull 12284 0 None closed Bug 1879407: Correctly use ops cluster size if evaluating es-ops component for restart 2021-02-17 09:04:58 UTC
Github openshift openshift-ansible pull 12309 0 None closed Bug 1879407: Adding int typecasting for _component_cluster_size 2021-02-24 05:04:36 UTC
Red Hat Knowledge Base (Solution) 5410671 0 None None None 2020-09-17 08:21:53 UTC
Red Hat Product Errata RHSA-2021:0637 0 None None None 2021-03-03 12:29:08 UTC

Description Andy Bartlett 2020-09-16 08:49:42 UTC
Description of problem:

We are trying to upgrade an Openshift 3.11 cluster to version 219 (from 154). When upgrading the EFK stack, the playbook fails when trying to restart the logging-es-ops cluster. This is because the openshift_logging_es_ops_cluster_size var is not used when getting the running pods in the cluster.

We are running this playbook:
/usr/share/ansible/openshift-ansible/playbooks/openshift-logging/config.yml

With openshift_logging_es_ops_cluster_size  set to 3 and openshift_logging_es_cluster_size set to 5

As you can see in the code, the task uses the openshift_logging_es_cluster_size var in the until clause: 
~~~
## get all pods for the cluster
- command: >
    {{ openshift_client_binary }}
    --config={{ openshift.common.config_base }}/master/admin.kubeconfig
    get pod
    -l component={{ _cluster_component }},provider=openshift
    -n {{ openshift_logging_elasticsearch_namespace }}
    -o jsonpath={.items[?(@.status.phase==\"Running\")].metadata.name}
  register: _cluster_pods
  retries: "{{ __elasticsearch_ready_retries }}"
  delay: 5
  until:
  - _cluster_pods.stdout is defined
  - _cluster_pods.stdout == "" or _cluster_pods.stdout.split(' ') | count == openshift_logging_es_cluster_size
~~~
https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.11.286-1/roles/openshift_logging_elasticsearch/tasks/restart_cluster.yml#L15

As there are 3 logging-es-ops node instead of 5, this check fails.

Version-Release number of selected component (if applicable):
OCP 3.11

How reproducible:
100%

Steps to Reproduce:
1. Customer has had this issue everytime with different clusters
2.
3.

Actual results:

Upgrade to the logging stack fails

Expected results:
Ideally the playbook would check whether openshift_logging_es_ops_cluster_size  is set for the logging-ops stack and use that variable instead when restarting the ops cluster.

Additional info:

Comment 3 Jeff Cantrill 2020-10-23 15:19:55 UTC
Setting UpcomingSprint as unable to resolve before EOD

Comment 8 Anping Li 2021-02-01 15:04:18 UTC
The fix is not in the package openshift-ansible-3.11.380-1.git.0.983c5d1.el7.noarch

Comment 10 ewolinet 2021-02-03 21:18:37 UTC
it looks like this fix didn't make it into 3.11.380-1 but should make it into 3.11.381-1 when it is released

Comment 12 Anping Li 2021-02-20 12:59:30 UTC
Created attachment 1758430 [details]
The inventory and playbook logs

openshift-ansible-3.11.391-1.git.0.aa2204f.el7.noarch

Comment 15 Anping Li 2021-02-24 05:50:41 UTC
Verified on openshift-ansible-roles-3.11.394-6.git.0.47ec25d.el7.noarch

Comment 17 errata-xmlrpc 2021-03-03 12:27:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 3.11.394 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0637


Note You need to log in before you can comment on or make changes to this bug.