Bug 1889211 - Logging playbook is failing at TASK "Enable shard balancing for logging-{{ _cluster_component }} cluster]"
Summary: Logging playbook is failing at TASK "Enable shard balancing for logging-{{ _c...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 3.11.z
Assignee: Jeff Cantrill
QA Contact: Anping Li
Whiteboard: logging-core
Depends On:
TreeView+ depends on / blocked
Reported: 2020-10-19 05:08 UTC by Apurva Nisal
Modified: 2020-11-09 13:28 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2020-11-09 13:28:00 UTC
Target Upstream Version:

Attachments (Terms of Use)

Description Apurva Nisal 2020-10-19 05:08:52 UTC
Description of problem:

During upgrading cluster from OCP 3.11.272 to OCP 3.11.286 the upgrade playbook is failing at:
(openshift-logging/config.yml ran after that is also failing at : )

RUNNING HANDLER [openshift_logging_elasticsearch : Enable shard balancing for logging-{{ _cluster_component }} cluster] ******************************************************************************************
fatal: [lxosm101p.vkbads.de]: FAILED! => {"changed": false, "cmd": ["curl", "-s", "-k", "--cert", "/tmp/openshift-logging-ansible-364jHA/admin-cert", "--key", "/tmp/openshift-logging-ansible-364jHA/admin-key", "-XPUT", "https://logging-es.openshift-logging.svc:9200/_cluster/settings", "-d", "{ \"persistent\": { \"cluster.routing.allocation.enable\" : \"all\" } }"], "delta": "0:00:01.014533", "end": "2020-09-21 19:31:46.576222", "msg": "non-zero return code", "rc": 7, "start": "2020-09-21 19:31:45.561689", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

Version-Release number of selected component (if applicable):
OCP 3.11.286 
Ansible  2.9.13

How reproducible:
Reproducible in customer's Cluster

Steps to Reproduce:

Actual results:
Playbook fail at TASK "Enable shard balancing for logging-{{ _cluster_component }} cluster"

Expected results:
Playbook finish successfully 

Additional info:

A] Logging stack details:

1) Number of es pods : 1

2) Shards:
    - name: PRIMARY_SHARDS
      value: "1"
    - name: REPLICA_SHARDS
      value: "0"

B] After playbook fails at TASK "Enable shard balancing for logging-{{ _cluster_component }} cluster" when logging stack is checked:

1) All pods are running.
2) All logging-stack images changed to OCP 3.11.286 
3) ES is healthy and green
4) Output of "es_util --query=_cluster/settings?pretty "
oc exec logging-es-data-master-5dp657y8-25-fgq4j -- es_util --query=_cluster/settings?pretty                                                                                                  
Defaulting container name to elasticsearch.                                                                                                                                                                       
Use 'oc describe pod/logging-es-data-master-5dp657y8-25-fgq4j -n openshift-logging' to see all of the containers in this pod.                                                                                     
  "persistent" : {                                                                                                                                                                                                
    "cluster" : {                                                                                                                                                                                                 
      "routing" : {                                                                                                                                                                                               
        "allocation" : {                                                                                                                                                                                          
          "enable" : "primaries"                                                                                                                                                                                  
  "transient" : { }                                                                                                                                                                                               

Comment 3 Jeff Cantrill 2020-10-19 16:03:41 UTC
RC 7 implies the curl statement was not able to connect to ES meaning maybe the pod wasn't running? There is no information in the logs to understand if there is even an ES cluster to upgrade.  Please provide additional info [1]. Note a single node ES cluster is not capable of supporting even a small OCP cluster and you will likely need to provide additional nodes and resources.

[1] https://github.com/openshift/origin-aggregated-logging/blob/release-3.11/hack/logging-dump.sh

Comment 6 Jeff Cantrill 2020-10-23 15:20:28 UTC
Setting UpcomingSprint as unable to resolve before EOD

Comment 8 Periklis Tsirakidis 2020-11-09 12:39:38 UTC

Is this issue resolved? Can we close this one?

Comment 10 Periklis Tsirakidis 2020-11-09 13:28:00 UTC
I cannot tell what the changes are, you might consult the errata please. We will close this issue and you can re-open if this happens again.

Note You need to log in before you can comment on or make changes to this bug.