Bug 1879407

Summary:

The restart-cluster playbook doesn't take into account that openshift_logging_es_ops_cluster_size could be different from openshift_logging_es_cluster_size

Product:

OpenShift Container Platform

Reporter:

Andy Bartlett <andbartl>

Component:

Logging

Assignee:

ewolinet

Status:

CLOSED ERRATA

QA Contact:

Anping Li <anli>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

3.11.0

CC:

aos-bugs, ewolinet, periklis

Target Milestone:

---

Target Release:

3.11.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

logging-exploration

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: The restart task did not evaluate using the defined cluster size for ops clusters Consequence: The restart would never complete Fix: Pass logging ops cluster size Result: Restarts of ops clusters complete as expected

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-03-03 12:27:45 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
The inventory and playbook logs	none

Description Andy Bartlett 2020-09-16 08:49:42 UTC

Description of problem:

We are trying to upgrade an Openshift 3.11 cluster to version 219 (from 154). When upgrading the EFK stack, the playbook fails when trying to restart the logging-es-ops cluster. This is because the openshift_logging_es_ops_cluster_size var is not used when getting the running pods in the cluster.

We are running this playbook:
/usr/share/ansible/openshift-ansible/playbooks/openshift-logging/config.yml

With openshift_logging_es_ops_cluster_size  set to 3 and openshift_logging_es_cluster_size set to 5

As you can see in the code, the task uses the openshift_logging_es_cluster_size var in the until clause: 
~~~
## get all pods for the cluster
- command: >
    {{ openshift_client_binary }}
    --config={{ openshift.common.config_base }}/master/admin.kubeconfig
    get pod
    -l component={{ _cluster_component }},provider=openshift
    -n {{ openshift_logging_elasticsearch_namespace }}
    -o jsonpath={.items[?(@.status.phase==\"Running\")].metadata.name}
  register: _cluster_pods
  retries: "{{ __elasticsearch_ready_retries }}"
  delay: 5
  until:
  - _cluster_pods.stdout is defined
  - _cluster_pods.stdout == "" or _cluster_pods.stdout.split(' ') | count == openshift_logging_es_cluster_size
~~~
https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.11.286-1/roles/openshift_logging_elasticsearch/tasks/restart_cluster.yml#L15

As there are 3 logging-es-ops node instead of 5, this check fails.

Version-Release number of selected component (if applicable):
OCP 3.11

How reproducible:
100%

Steps to Reproduce:
1. Customer has had this issue everytime with different clusters
2.
3.

Actual results:

Upgrade to the logging stack fails

Expected results:
Ideally the playbook would check whether openshift_logging_es_ops_cluster_size  is set for the logging-ops stack and use that variable instead when restarting the ops cluster.

Additional info:

Comment 3 Jeff Cantrill 2020-10-23 15:19:55 UTC

Setting UpcomingSprint as unable to resolve before EOD

Comment 8 Anping Li 2021-02-01 15:04:18 UTC

The fix is not in the package openshift-ansible-3.11.380-1.git.0.983c5d1.el7.noarch

Comment 10 ewolinet 2021-02-03 21:18:37 UTC

it looks like this fix didn't make it into 3.11.380-1 but should make it into 3.11.381-1 when it is released

Comment 12 Anping Li 2021-02-20 12:59:30 UTC

Created attachment 1758430 [details]
The inventory and playbook logs

openshift-ansible-3.11.391-1.git.0.aa2204f.el7.noarch

Comment 15 Anping Li 2021-02-24 05:50:41 UTC

Verified on openshift-ansible-roles-3.11.394-6.git.0.47ec25d.el7.noarch

Comment 17 errata-xmlrpc 2021-03-03 12:27:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 3.11.394 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0637