Bug 1879407

Summary: The restart-cluster playbook doesn't take into account that openshift_logging_es_ops_cluster_size could be different from openshift_logging_es_cluster_size
Product: OpenShift Container Platform Reporter: Andy Bartlett <andbartl>
Component: LoggingAssignee: ewolinet
Status: CLOSED ERRATA QA Contact: Anping Li <anli>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: aos-bugs, ewolinet, periklis
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: logging-exploration
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The restart task did not evaluate using the defined cluster size for ops clusters Consequence: The restart would never complete Fix: Pass logging ops cluster size Result: Restarts of ops clusters complete as expected
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-03 12:27:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
The inventory and playbook logs none

Description Andy Bartlett 2020-09-16 08:49:42 UTC
Description of problem:

We are trying to upgrade an Openshift 3.11 cluster to version 219 (from 154). When upgrading the EFK stack, the playbook fails when trying to restart the logging-es-ops cluster. This is because the openshift_logging_es_ops_cluster_size var is not used when getting the running pods in the cluster.

We are running this playbook:
/usr/share/ansible/openshift-ansible/playbooks/openshift-logging/config.yml

With openshift_logging_es_ops_cluster_size  set to 3 and openshift_logging_es_cluster_size set to 5

As you can see in the code, the task uses the openshift_logging_es_cluster_size var in the until clause: 
~~~
## get all pods for the cluster
- command: >
    {{ openshift_client_binary }}
    --config={{ openshift.common.config_base }}/master/admin.kubeconfig
    get pod
    -l component={{ _cluster_component }},provider=openshift
    -n {{ openshift_logging_elasticsearch_namespace }}
    -o jsonpath={.items[?(@.status.phase==\"Running\")].metadata.name}
  register: _cluster_pods
  retries: "{{ __elasticsearch_ready_retries }}"
  delay: 5
  until:
  - _cluster_pods.stdout is defined
  - _cluster_pods.stdout == "" or _cluster_pods.stdout.split(' ') | count == openshift_logging_es_cluster_size
~~~
https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.11.286-1/roles/openshift_logging_elasticsearch/tasks/restart_cluster.yml#L15

As there are 3 logging-es-ops node instead of 5, this check fails.

Version-Release number of selected component (if applicable):
OCP 3.11

How reproducible:
100%

Steps to Reproduce:
1. Customer has had this issue everytime with different clusters
2.
3.

Actual results:

Upgrade to the logging stack fails

Expected results:
Ideally the playbook would check whether openshift_logging_es_ops_cluster_size  is set for the logging-ops stack and use that variable instead when restarting the ops cluster.

Additional info:

Comment 3 Jeff Cantrill 2020-10-23 15:19:55 UTC
Setting UpcomingSprint as unable to resolve before EOD

Comment 8 Anping Li 2021-02-01 15:04:18 UTC
The fix is not in the package openshift-ansible-3.11.380-1.git.0.983c5d1.el7.noarch

Comment 10 ewolinet 2021-02-03 21:18:37 UTC
it looks like this fix didn't make it into 3.11.380-1 but should make it into 3.11.381-1 when it is released

Comment 12 Anping Li 2021-02-20 12:59:30 UTC
Created attachment 1758430 [details]
The inventory and playbook logs

openshift-ansible-3.11.391-1.git.0.aa2204f.el7.noarch

Comment 15 Anping Li 2021-02-24 05:50:41 UTC
Verified on openshift-ansible-roles-3.11.394-6.git.0.47ec25d.el7.noarch

Comment 17 errata-xmlrpc 2021-03-03 12:27:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 3.11.394 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0637