Bug 1879407
| Summary: | The restart-cluster playbook doesn't take into account that openshift_logging_es_ops_cluster_size could be different from openshift_logging_es_cluster_size | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Andy Bartlett <andbartl> | ||||
| Component: | Logging | Assignee: | ewolinet | ||||
| Status: | CLOSED ERRATA | QA Contact: | Anping Li <anli> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 3.11.0 | CC: | aos-bugs, ewolinet, periklis | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.11.z | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | logging-exploration | ||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
Cause: The restart task did not evaluate using the defined cluster size for ops clusters
Consequence: The restart would never complete
Fix: Pass logging ops cluster size
Result: Restarts of ops clusters complete as expected
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-03-03 12:27:45 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Setting UpcomingSprint as unable to resolve before EOD The fix is not in the package openshift-ansible-3.11.380-1.git.0.983c5d1.el7.noarch it looks like this fix didn't make it into 3.11.380-1 but should make it into 3.11.381-1 when it is released Created attachment 1758430 [details]
The inventory and playbook logs
openshift-ansible-3.11.391-1.git.0.aa2204f.el7.noarch
Verified on openshift-ansible-roles-3.11.394-6.git.0.47ec25d.el7.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 3.11.394 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0637 |
Description of problem: We are trying to upgrade an Openshift 3.11 cluster to version 219 (from 154). When upgrading the EFK stack, the playbook fails when trying to restart the logging-es-ops cluster. This is because the openshift_logging_es_ops_cluster_size var is not used when getting the running pods in the cluster. We are running this playbook: /usr/share/ansible/openshift-ansible/playbooks/openshift-logging/config.yml With openshift_logging_es_ops_cluster_size set to 3 and openshift_logging_es_cluster_size set to 5 As you can see in the code, the task uses the openshift_logging_es_cluster_size var in the until clause: ~~~ ## get all pods for the cluster - command: > {{ openshift_client_binary }} --config={{ openshift.common.config_base }}/master/admin.kubeconfig get pod -l component={{ _cluster_component }},provider=openshift -n {{ openshift_logging_elasticsearch_namespace }} -o jsonpath={.items[?(@.status.phase==\"Running\")].metadata.name} register: _cluster_pods retries: "{{ __elasticsearch_ready_retries }}" delay: 5 until: - _cluster_pods.stdout is defined - _cluster_pods.stdout == "" or _cluster_pods.stdout.split(' ') | count == openshift_logging_es_cluster_size ~~~ https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.11.286-1/roles/openshift_logging_elasticsearch/tasks/restart_cluster.yml#L15 As there are 3 logging-es-ops node instead of 5, this check fails. Version-Release number of selected component (if applicable): OCP 3.11 How reproducible: 100% Steps to Reproduce: 1. Customer has had this issue everytime with different clusters 2. 3. Actual results: Upgrade to the logging stack fails Expected results: Ideally the playbook would check whether openshift_logging_es_ops_cluster_size is set for the logging-ops stack and use that variable instead when restarting the ops cluster. Additional info: