1879407 – The restart-cluster playbook doesn't take into account that openshift_logging_es_ops_cluster_size could be different from openshift_logging_es_cluster_size

Bug 1879407 - The restart-cluster playbook doesn't take into account that openshift_logging_es_ops_cluster_size could be different from openshift_logging_es_cluster_size

Summary: The restart-cluster playbook doesn't take into account that openshift_logging...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.11.z
Assignee:	ewolinet
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:	logging-exploration
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-16 08:49 UTC by Andy Bartlett
Modified:	2024-06-13 23:05 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The restart task did not evaluate using the defined cluster size for ops clusters Consequence: The restart would never complete Fix: Pass logging ops cluster size Result: Restarts of ops clusters complete as expected
Clone Of:
Environment:
Last Closed:	2021-03-03 12:27:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
The inventory and playbook logs (1.08 MB, application/gzip) 2021-02-20 12:59 UTC, Anping Li	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift openshift-ansible pull 12284	None	closed	Bug 1879407: Correctly use ops cluster size if evaluating es-ops component for restart	2021-02-17 09:04:58 UTC
Github	openshift openshift-ansible pull 12309	None	closed	Bug 1879407: Adding int typecasting for _component_cluster_size	2021-02-24 05:04:36 UTC
Red Hat Knowledge Base (Solution)	5410671	None	None	None	2020-09-17 08:21:53 UTC
Red Hat Product Errata	RHSA-2021:0637	None	None	None	2021-03-03 12:29:08 UTC

Description Andy Bartlett 2020-09-16 08:49:42 UTC

Description of problem:

We are trying to upgrade an Openshift 3.11 cluster to version 219 (from 154). When upgrading the EFK stack, the playbook fails when trying to restart the logging-es-ops cluster. This is because the openshift_logging_es_ops_cluster_size var is not used when getting the running pods in the cluster.

We are running this playbook:
/usr/share/ansible/openshift-ansible/playbooks/openshift-logging/config.yml

With openshift_logging_es_ops_cluster_size  set to 3 and openshift_logging_es_cluster_size set to 5

As you can see in the code, the task uses the openshift_logging_es_cluster_size var in the until clause: 
~~~
## get all pods for the cluster
- command: >
    {{ openshift_client_binary }}
    --config={{ openshift.common.config_base }}/master/admin.kubeconfig
    get pod
    -l component={{ _cluster_component }},provider=openshift
    -n {{ openshift_logging_elasticsearch_namespace }}
    -o jsonpath={.items[?(@.status.phase==\"Running\")].metadata.name}
  register: _cluster_pods
  retries: "{{ __elasticsearch_ready_retries }}"
  delay: 5
  until:
  - _cluster_pods.stdout is defined
  - _cluster_pods.stdout == "" or _cluster_pods.stdout.split(' ') | count == openshift_logging_es_cluster_size
~~~
https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.11.286-1/roles/openshift_logging_elasticsearch/tasks/restart_cluster.yml#L15

As there are 3 logging-es-ops node instead of 5, this check fails.

Version-Release number of selected component (if applicable):
OCP 3.11

How reproducible:
100%

Steps to Reproduce:
1. Customer has had this issue everytime with different clusters
2.
3.

Actual results:

Upgrade to the logging stack fails

Expected results:
Ideally the playbook would check whether openshift_logging_es_ops_cluster_size  is set for the logging-ops stack and use that variable instead when restarting the ops cluster.

Additional info:

Comment 3 Jeff Cantrill 2020-10-23 15:19:55 UTC

Setting UpcomingSprint as unable to resolve before EOD

Comment 8 Anping Li 2021-02-01 15:04:18 UTC

The fix is not in the package openshift-ansible-3.11.380-1.git.0.983c5d1.el7.noarch

Comment 10 ewolinet 2021-02-03 21:18:37 UTC

it looks like this fix didn't make it into 3.11.380-1 but should make it into 3.11.381-1 when it is released

Comment 12 Anping Li 2021-02-20 12:59:30 UTC

Created attachment 1758430 [details]
The inventory and playbook logs

openshift-ansible-3.11.391-1.git.0.aa2204f.el7.noarch

Comment 15 Anping Li 2021-02-24 05:50:41 UTC

Verified on openshift-ansible-roles-3.11.394-6.git.0.47ec25d.el7.noarch

Comment 17 errata-xmlrpc 2021-03-03 12:27:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 3.11.394 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0637

Note You need to log in before you can comment on or make changes to this bug.