Bug 1835046

Summary:	Logging 4.4 shardAllocationEnabled is `none` after upgrade OCP cluster from 4.4 to 4.5.
Product:	OpenShift Container Platform	Reporter:	Qiaoling Tang <qitang>
Component:	Logging	Assignee:	ewolinet
Status:	CLOSED ERRATA	QA Contact:	Anping Li <anli>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.5	CC:	aos-bugs, ewolinet, periklis, ssadhale
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-13 17:38:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Qiaoling Tang 2020-05-13 01:46:04 UTC

Description of problem:
Deploy logging 4.4 on a 4.4 OCP cluster, then upgrade the cluster to 4.5, after the upgrade finished, check the indices in the ES pods, all the indices are in `yellow` status:

$ oc get pod
NAME                                           READY   STATUS      RESTARTS   AGE
cluster-logging-operator-c4464f-fszv7          1/1     Running     0          12m
curator-1589333400-7vl4w                       0/1     Completed   0          6m31s
elasticsearch-cdm-m2j2lxw9-1-5cd8fb7c9-9n26s   2/2     Running     0          19m
elasticsearch-cdm-m2j2lxw9-2-f9f58676-cbtnn    2/2     Running     0          12m
elasticsearch-cdm-m2j2lxw9-3-78df69dcf-wjg7l   2/2     Running     0          16m
fluentd-8kgwd                                  1/1     Running     0          34m
fluentd-8qn7x                                  1/1     Running     0          33m
fluentd-c95xq                                  1/1     Running     0          33m
fluentd-csxjs                                  1/1     Running     0          34m
fluentd-h585s                                  1/1     Running     2          34m
fluentd-pg7h7                                  1/1     Running     3          33m
kibana-6ff5c8d8f-s5tgq                         2/2     Running     0          16m
$ oc exec elasticsearch-cdm-m2j2lxw9-1-5cd8fb7c9-9n26s -- indices
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-m2j2lxw9-1-5cd8fb7c9-9n26s -n openshift-logging' to see all of the containers in this pod.
Wed May 13 01:36:29 UTC 2020
health status index                                                          uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   project.qitang.5682515f-a751-4bb1-a98c-a1ca0b381376.2020.05.13 LmyeqjBQTuOVO6-HEr03YA   3   1       2081            0          2              2
yellow open   .kibana.647a750f1787408bf50088234ec0edd5a6a9b2ac               CvSIX5rNRVSCMFEIH9hyeQ   1   1          2            0          0              0
yellow open   .operations.2020.05.13                                         Ob9V0iMiTQ6uTnuu5qOSGg   3   1    1179013            0       1540           1540
yellow open   .searchguard                                                   UC4ukmZvTc61M5HNPLiqyg   1   1          5            0          0              0
yellow open   .kibana                                                        1LLeP9ZRROKU0WyFaGfQfw   1   1          1            0          0              0

$ oc exec elasticsearch-cdm-m2j2lxw9-1-5cd8fb7c9-9n26s -- es_util --query=_cat/nodes?v
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-m2j2lxw9-1-5cd8fb7c9-9n26s -n openshift-logging' to see all of the containers in this pod.
ip          heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.128.2.18           34          71  14    0.39    0.83     1.07 mdi       *      elasticsearch-cdm-m2j2lxw9-1
10.131.0.20           55          69  18    0.33    0.54     0.71 mdi       -      elasticsearch-cdm-m2j2lxw9-3
10.129.2.5            34          42  18    0.75    0.77     0.68 mdi       -      elasticsearch-cdm-m2j2lxw9-2

$ oc exec elasticsearch-cdm-m2j2lxw9-1-5cd8fb7c9-9n26s -- es_util --query=_cat/shards?v
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-m2j2lxw9-1-5cd8fb7c9-9n26s -n openshift-logging' to see all of the containers in this pod.
index                                                          shard prirep state        docs   store ip          node
project.qitang.5682515f-a751-4bb1-a98c-a1ca0b381376.2020.05.13 2     p      STARTED       767   1.5mb 10.131.0.20 elasticsearch-cdm-m2j2lxw9-3
project.qitang.5682515f-a751-4bb1-a98c-a1ca0b381376.2020.05.13 2     r      UNASSIGNED                            
project.qitang.5682515f-a751-4bb1-a98c-a1ca0b381376.2020.05.13 1     p      STARTED       707   1.4mb 10.129.2.5  elasticsearch-cdm-m2j2lxw9-2
project.qitang.5682515f-a751-4bb1-a98c-a1ca0b381376.2020.05.13 1     r      UNASSIGNED                            
project.qitang.5682515f-a751-4bb1-a98c-a1ca0b381376.2020.05.13 0     p      STARTED       763 988.5kb 10.131.0.20 elasticsearch-cdm-m2j2lxw9-3
project.qitang.5682515f-a751-4bb1-a98c-a1ca0b381376.2020.05.13 0     r      UNASSIGNED                            
.kibana                                                        0     p      STARTED         1   3.2kb 10.128.2.18 elasticsearch-cdm-m2j2lxw9-1
.kibana                                                        0     r      UNASSIGNED                            
.searchguard                                                   0     p      STARTED         5 168.1kb 10.131.0.20 elasticsearch-cdm-m2j2lxw9-3
.searchguard                                                   0     r      UNASSIGNED                            
.kibana.647a750f1787408bf50088234ec0edd5a6a9b2ac               0     p      STARTED         2  27.2kb 10.129.2.5  elasticsearch-cdm-m2j2lxw9-2
.kibana.647a750f1787408bf50088234ec0edd5a6a9b2ac               0     r      UNASSIGNED                            
.operations.2020.05.13                                         2     p      STARTED    470875 559.8mb 10.131.0.20 elasticsearch-cdm-m2j2lxw9-3
.operations.2020.05.13                                         2     r      UNASSIGNED                            
.operations.2020.05.13                                         1     p      STARTED    469501 548.9mb 10.129.2.5  elasticsearch-cdm-m2j2lxw9-2
.operations.2020.05.13                                         1     r      UNASSIGNED                            
.operations.2020.05.13                                         0     p      STARTED    469184 536.1mb 10.129.2.5  elasticsearch-cdm-m2j2lxw9-2
.operations.2020.05.13                                         0     r      UNASSIGNED                            

$ oc logs -c elasticsearch elasticsearch-cdm-m2j2lxw9-2-f9f58676-cbtnn 
[2020-05-13 01:27:42,830][INFO ][container.run            ] Begin Elasticsearch startup script
[2020-05-13 01:27:42,843][INFO ][container.run            ] Comparing the specified RAM to the maximum recommended for Elasticsearch...
[2020-05-13 01:27:42,845][INFO ][container.run            ] Inspecting the maximum RAM available...
[2020-05-13 01:27:42,848][INFO ][container.run            ] ES_JAVA_OPTS: ' -Xms2048m -Xmx2048m'
[2020-05-13 01:27:42,850][INFO ][container.run            ] Copying certs from /etc/openshift/elasticsearch/secret to /etc/elasticsearch/secret
[2020-05-13 01:27:42,863][INFO ][container.run            ] Building required jks files and truststore
Importing keystore /etc/elasticsearch/secret/admin.p12 to /etc/elasticsearch/secret/admin.jks...
Entry for alias 1 successfully imported.
Import command completed:  1 entries successfully imported, 0 entries failed or cancelled

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/elasticsearch/secret/admin.jks -destkeystore /etc/elasticsearch/secret/admin.jks -deststoretype pkcs12".

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/elasticsearch/secret/admin.jks -destkeystore /etc/elasticsearch/secret/admin.jks -deststoretype pkcs12".
Certificate was added to keystore

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/elasticsearch/secret/admin.jks -destkeystore /etc/elasticsearch/secret/admin.jks -deststoretype pkcs12".
Importing keystore /etc/elasticsearch/secret/elasticsearch.p12 to /etc/elasticsearch/secret/elasticsearch.jks...
Entry for alias 1 successfully imported.
Import command completed:  1 entries successfully imported, 0 entries failed or cancelled

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/elasticsearch/secret/elasticsearch.jks -destkeystore /etc/elasticsearch/secret/elasticsearch.jks -deststoretype pkcs12".

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/elasticsearch/secret/elasticsearch.jks -destkeystore /etc/elasticsearch/secret/elasticsearch.jks -deststoretype pkcs12".
Certificate was added to keystore

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/elasticsearch/secret/elasticsearch.jks -destkeystore /etc/elasticsearch/secret/elasticsearch.jks -deststoretype pkcs12".
Importing keystore /etc/elasticsearch/secret/logging-es.p12 to /etc/elasticsearch/secret/logging-es.jks...
Entry for alias 1 successfully imported.
Import command completed:  1 entries successfully imported, 0 entries failed or cancelled

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/elasticsearch/secret/logging-es.jks -destkeystore /etc/elasticsearch/secret/logging-es.jks -deststoretype pkcs12".

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/elasticsearch/secret/logging-es.jks -destkeystore /etc/elasticsearch/secret/logging-es.jks -deststoretype pkcs12".
Certificate was added to keystore

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /etc/elasticsearch/secret/logging-es.jks -destkeystore /etc/elasticsearch/secret/logging-es.jks -deststoretype pkcs12".
Certificate was added to keystore
Certificate was added to keystore
[2020-05-13 01:27:45,114][INFO ][container.run            ] Setting heap dump location /elasticsearch/persistent/heapdump.hprof
[2020-05-13 01:27:45,115][INFO ][container.run            ] ES_JAVA_OPTS: ' -Xms2048m -Xmx2048m -XX:HeapDumpPath=/elasticsearch/persistent/heapdump.hprof -Dsg.display_lic_none=false -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.type=unpooled'
[2020-05-13 01:27:45,164][INFO ][container.run            ] Checking if Elasticsearch is ready

### LICENSE NOTICE Search Guard ###

If you use one or more of the following features in production
make sure you have a valid Search Guard license
(See https://floragunn.com/searchguard-validate-license)

* Kibana Multitenancy
* LDAP authentication/authorization
* Active Directory authentication/authorization
* REST Management API
* JSON Web Token (JWT) authentication/authorization
* Kerberos authentication/authorization
* Document- and Fieldlevel Security (DLS/FLS)
* Auditlogging

In case of any doubt mail to <sales>
###################################

### LICENSE NOTICE Search Guard ###

If you use one or more of the following features in production
make sure you have a valid Search Guard license
(See https://floragunn.com/searchguard-validate-license)

* Kibana Multitenancy
* LDAP authentication/authorization
* Active Directory authentication/authorization
* REST Management API
* JSON Web Token (JWT) authentication/authorization
* Kerberos authentication/authorization
* Document- and Fieldlevel Security (DLS/FLS)
* Auditlogging

In case of any doubt mail to <sales>
###################################
Consider setting -Djdk.tls.rejectClientInitiatedRenegotiation=true to prevent DoS attacks through client side initiated TLS renegotiation.
Consider setting -Djdk.tls.rejectClientInitiatedRenegotiation=true to prevent DoS attacks through client side initiated TLS renegotiation.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[2020-05-13 01:27:57,880][INFO ][container.run            ] Elasticsearch is ready and listening
/usr/share/elasticsearch/init ~
[2020-05-13 01:27:57,925][INFO ][container.run            ] Starting init script: 0001-jaeger
[2020-05-13 01:27:57,942][INFO ][container.run            ] Completed init script: 0001-jaeger
[2020-05-13 01:27:58,230][INFO ][container.run            ] Forcing the seeding of ACL documents
[2020-05-13 01:27:58,240][INFO ][container.run            ] Seeding the searchguard ACL index.  Will wait up to 604800 seconds.
[2020-05-13 01:27:58,345][INFO ][container.run            ] Seeding the searchguard ACL index.  Will wait up to 604800 seconds.
/etc/elasticsearch /usr/share/elasticsearch/init
Search Guard Admin v5
Will connect to localhost:9300 ... done
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2
Elasticsearch Version: 5.6.16
Search Guard Version: <unknown>
Contacting elasticsearch cluster 'elasticsearch' ...
Clustername: elasticsearch
Clusterstate: RED
Number of nodes: 3
Number of data nodes: 3
.searchguard index already exists, so we do not need to create one.
INFO: .searchguard index state is YELLOW, it seems you miss some replicas
Populate config from /opt/app-root/src/sgconfig/
Will update 'config' with /opt/app-root/src/sgconfig/sg_config.yml
   SUCC: Configuration for 'config' created or updated
Will update 'roles' with /opt/app-root/src/sgconfig/sg_roles.yml
   SUCC: Configuration for 'roles' created or updated
Will update 'rolesmapping' with /opt/app-root/src/sgconfig/sg_roles_mapping.yml
   SUCC: Configuration for 'rolesmapping' created or updated
Will update 'internalusers' with /opt/app-root/src/sgconfig/sg_internal_users.yml
   SUCC: Configuration for 'internalusers' created or updated
Will update 'actiongroups' with /opt/app-root/src/sgconfig/sg_action_groups.yml
   SUCC: Configuration for 'actiongroups' created or updated
Done with success
/usr/share/elasticsearch/init
[2020-05-13 01:28:04,919][INFO ][container.run            ] Seeded the searchguard ACL index
[2020-05-13 01:28:04,924][INFO ][container.run            ] Disabling auto replication
/etc/elasticsearch /usr/share/elasticsearch/init
Search Guard Admin v5
Will connect to localhost:9300 ... done
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2
Elasticsearch Version: 5.6.16
Search Guard Version: <unknown>
Reload config on all nodes
Auto-expand replicas disabled
/usr/share/elasticsearch/init
[2020-05-13 01:28:09,844][INFO ][container.run            ] Updating replica count to 1
/etc/elasticsearch /usr/share/elasticsearch/init
Search Guard Admin v5
Will connect to localhost:9300 ... done
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2
Elasticsearch Version: 5.6.16
Search Guard Version: <unknown>
Reload config on all nodes
Update number of replicas to 1 with result: true
/usr/share/elasticsearch/init
[2020-05-13 01:28:16,438][INFO ][container.run            ] Adding index templates
[2020-05-13 01:28:16,861][INFO ][container.run            ] Index template 'com.redhat.viaq-openshift-operations.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-05-13 01:28:17,670][INFO ][container.run            ] Index template 'com.redhat.viaq-openshift-orphaned.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-05-13 01:28:18,286][INFO ][container.run            ] Index template 'com.redhat.viaq-openshift-project.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-05-13 01:28:19,058][INFO ][container.run            ] Index template 'common.settings.kibana.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-05-13 01:28:19,574][INFO ][container.run            ] Index template 'common.settings.operations.orphaned.json' found in the cluster, overriding it
{"acknowledged":true}[2020-05-13 01:28:20,292][INFO ][container.run            ] Index template 'common.settings.operations.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-05-13 01:28:20,921][INFO ][container.run            ] Index template 'common.settings.project.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-05-13 01:28:21,417][INFO ][container.run            ] Index template 'jaeger-service.json' found in the cluster, overriding it
{"acknowledged":true}[2020-05-13 01:28:22,135][INFO ][container.run            ] Index template 'jaeger-span.json' found in the cluster, overriding it
{"acknowledged":true}[2020-05-13 01:28:22,841][INFO ][container.run            ] Index template 'org.ovirt.viaq-collectd.template.json' found in the cluster, overriding it
{"acknowledged":true}[2020-05-13 01:28:23,068][INFO ][container.run            ] Finished adding index templates
[2020-05-13 01:28:23,102][INFO ][container.run            ] Starting init script: 0500-remove-index-patterns-without-uid
[2020-05-13 01:28:23,450][INFO ][container.run            ] Found 1 index-patterns to evaluate for removal
[2020-05-13 01:28:23,954][INFO ][container.run            ] Completed init script: 0500-remove-index-patterns-without-uid with 0 successful and 0 failed bulk requests
[2020-05-13 01:28:23,969][INFO ][container.run            ] Starting init script: 0510-bz1656086-remove-index-patterns-with-bad-title
[2020-05-13 01:28:24,207][INFO ][container.run            ] Found 0 index-patterns to remove
[2020-05-13 01:28:24,502][INFO ][container.run            ] Completed init script: 0510-bz1656086-remove-index-patterns-with-bad-title
[2020-05-13 01:28:24,562][INFO ][container.run            ] Starting init script: 0520-bz1658632-remove-old-sg-indices
[2020-05-13 01:28:25,162][WARN ][container.run            ] Found .searchguard setting 'index.routing.allocation.include._name' to be null
[2020-05-13 01:28:25,179][INFO ][container.run            ] Updating .searchguard setting 'index.routing.allocation.include._name' to be null
[2020-05-13 01:28:25,554][INFO ][container.run            ] Completed init script: 0520-bz1658632-remove-old-sg-indices
[2020-05-13 01:28:25,589][INFO ][container.run            ] Starting init script: 0530-bz1667801-fix-kibana-replica-shards
[2020-05-13 01:28:25,977][INFO ][container.run            ] Found 0 Kibana indices with replica count not equal to 1
[2020-05-13 01:28:26,016][INFO ][container.run            ] Completed init script: 0530-bz1667801-fix-kibana-replica-shards
~
May 13, 2020 1:33:02 AM okhttp3.internal.platform.Platform log
WARNING: A connection to https://kubernetes.default.svc/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);
May 13, 2020 1:33:03 AM okhttp3.internal.platform.Platform log
WARNING: A connection to https://kubernetes.default.svc/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);
May 13, 2020 1:33:03 AM okhttp3.internal.platform.Platform log
WARNING: A connection to https://kubernetes.default.svc/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);
May 13, 2020 1:33:03 AM okhttp3.internal.platform.Platform log
WARNING: A connection to https://kubernetes.default.svc/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);
May 13, 2020 1:35:04 AM okhttp3.internal.platform.Platform log
WARNING: A connection to https://kubernetes.default.svc/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);
May 13, 2020 1:35:04 AM okhttp3.internal.platform.Platform log
WARNING: A connection to https://kubernetes.default.svc/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);
May 13, 2020 1:35:04 AM okhttp3.internal.platform.Platform log
WARNING: A connection to https://kubernetes.default.svc/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);
May 13, 2020 1:35:04 AM okhttp3.internal.platform.Platform log
WARNING: A connection to https://kubernetes.default.svc/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);
May 13, 2020 1:37:30 AM okhttp3.internal.platform.Platform log
WARNING: A connection to https://kubernetes.default.svc/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);
May 13, 2020 1:37:30 AM okhttp3.internal.platform.Platform log
WARNING: A connection to https://kubernetes.default.svc/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);
May 13, 2020 1:37:30 AM okhttp3.internal.platform.Platform log
WARNING: A connection to https://kubernetes.default.svc/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);
May 13, 2020 1:37:30 AM okhttp3.internal.platform.Platform log
WARNING: A connection to https://kubernetes.default.svc/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);
May 13, 2020 1:39:34 AM okhttp3.internal.platform.Platform log
WARNING: A connection to https://kubernetes.default.svc/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);
May 13, 2020 1:39:34 AM okhttp3.internal.platform.Platform log


shardAllocationEnabled is none in the elasticseach CR instance:
  spec:
    managementState: Managed
    nodeSpec:
      image: quay.io/openshift-qe-optional-operators/ose-logging-elasticsearch5@sha256:52ff8ea1971f59351876ed59d55413e6911848b3578d4254f813fd9e5f53d203
      resources:
        requests:
          cpu: "1"
          memory: 4Gi
    nodes:
    - genUUID: m2j2lxw9
      nodeCount: 3
      resources: {}
      roles:
      - client
      - data
      - master
      storage:
        size: 20Gi
        storageClassName: gp2
    redundancyPolicy: SingleRedundancy
  status:
    cluster:
      activePrimaryShards: 9
      activeShards: 9
      initializingShards: 0
      numDataNodes: 3
      numNodes: 3
      pendingTasks: 0
      relocatingShards: 0
      status: yellow
      unassignedShards: 9
    clusterHealth: ""
    conditions: []
    nodes:
    - deploymentName: elasticsearch-cdm-m2j2lxw9-1
      upgradeStatus:
        scheduledUpgrade: "True"
        upgradePhase: controllerUpdated
    - deploymentName: elasticsearch-cdm-m2j2lxw9-2
      upgradeStatus:
        upgradePhase: controllerUpdated
    - deploymentName: elasticsearch-cdm-m2j2lxw9-3
      upgradeStatus:
        scheduledUpgrade: "True"
        upgradePhase: controllerUpdated
    pods:
      client:
        failed: []
        notReady: []
        ready:
        - elasticsearch-cdm-m2j2lxw9-1-5cd8fb7c9-9n26s
        - elasticsearch-cdm-m2j2lxw9-2-f9f58676-cbtnn
        - elasticsearch-cdm-m2j2lxw9-3-78df69dcf-wjg7l
      data:
        failed: []
        notReady: []
        ready:
        - elasticsearch-cdm-m2j2lxw9-1-5cd8fb7c9-9n26s
        - elasticsearch-cdm-m2j2lxw9-2-f9f58676-cbtnn
        - elasticsearch-cdm-m2j2lxw9-3-78df69dcf-wjg7l
      master:
        failed: []
        notReady: []
        ready:
        - elasticsearch-cdm-m2j2lxw9-1-5cd8fb7c9-9n26s
        - elasticsearch-cdm-m2j2lxw9-2-f9f58676-cbtnn
        - elasticsearch-cdm-m2j2lxw9-3-78df69dcf-wjg7l
    shardAllocationEnabled: none


$ oc logs -n openshift-operators-redhat elasticsearch-operator-549f7dcfbc-f7w7z 
time="2020-05-13T01:24:55Z" level=warning msg="Unable to parse loglevel \"\""
{"level":"info","ts":1589333095.497479,"logger":"cmd","msg":"Go Version: go1.13.4"}
{"level":"info","ts":1589333095.497508,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1589333095.4975128,"logger":"cmd","msg":"Version of operator-sdk: v0.8.2"}
{"level":"info","ts":1589333095.498378,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1589333096.0980136,"logger":"leader","msg":"No pre-existing lock was found."}
{"level":"info","ts":1589333096.1228838,"logger":"leader","msg":"Became the leader."}
{"level":"info","ts":1589333096.404264,"logger":"cmd","msg":"Registering Components."}
{"level":"info","ts":1589333096.4047568,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"elasticsearch-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1589333096.6361701,"logger":"cmd","msg":"failed to create or get service for metrics: services \"elasticsearch-operator\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>"}
{"level":"info","ts":1589333096.636197,"logger":"cmd","msg":"Starting the Cmd."}
{"level":"info","ts":1589333096.738808,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"elasticsearch-controller"}
{"level":"info","ts":1589333096.859998,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"elasticsearch-controller","worker count":1}
time="2020-05-13T01:24:59Z" level=info msg="Requested to update node 'elasticsearch-cdm-m2j2lxw9-3', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"
time="2020-05-13T01:25:32Z" level=warning msg="GetClusterHealthStatus error: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/health: dial tcp 172.30.202.177:9200: i/o timeout\n"
time="2020-05-13T01:25:32Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-m2j2lxw9-2:  / green"
time="2020-05-13T01:25:32Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-m2j2lxw9-2: Cluster not in green state before beginning upgrade: "
time="2020-05-13T01:25:36Z" level=info msg="Requested to update node 'elasticsearch-cdm-m2j2lxw9-2', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"
time="2020-05-13T01:25:44Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-m2j2lxw9-2: red / green"
time="2020-05-13T01:25:44Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-m2j2lxw9-2: Cluster not in green state before beginning upgrade: red"
time="2020-05-13T01:26:09Z" level=info msg="Requested to update node 'elasticsearch-cdm-m2j2lxw9-2', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"
time="2020-05-13T01:26:39Z" level=info msg="Timed out waiting for node elasticsearch-cdm-m2j2lxw9-2 to rollout"
time="2020-05-13T01:26:40Z" level=warning msg="Failed to progress update of unschedulable node 'elasticsearch-cdm-m2j2lxw9-2': timed out waiting for the condition"
time="2020-05-13T01:26:42Z" level=info msg="Requested to update node 'elasticsearch-cdm-m2j2lxw9-2', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"
time="2020-05-13T01:27:12Z" level=info msg="Timed out waiting for node elasticsearch-cdm-m2j2lxw9-2 to rollout"
time="2020-05-13T01:27:12Z" level=warning msg="Failed to progress update of unschedulable node 'elasticsearch-cdm-m2j2lxw9-2': timed out waiting for the condition"
time="2020-05-13T01:27:12Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-m2j2lxw9-1: red / green"
time="2020-05-13T01:27:12Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-m2j2lxw9-1: Cluster not in green state before beginning upgrade: red"
time="2020-05-13T01:27:12Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-m2j2lxw9-3: red / green"
time="2020-05-13T01:27:12Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-m2j2lxw9-3: Cluster not in green state before beginning upgrade: red"
time="2020-05-13T01:27:13Z" level=info msg="Requested to update node 'elasticsearch-cdm-m2j2lxw9-2', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"
E0513 01:27:16.385547       1 reflector.go:251] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to watch *v1.Service: Get https://172.30.0.1:443/api/v1/services?resourceVersion=82093&timeoutSeconds=420&watch=true: dial tcp 172.30.0.1:443: connect: connection refused
E0513 01:27:16.387918       1 reflector.go:251] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to watch *v1.ClusterRoleBinding: Get https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?resourceVersion=82675&timeoutSeconds=305&watch=true: dial tcp 172.30.0.1:443: connect: connection refused
E0513 01:27:16.393165       1 reflector.go:251] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to watch *v1.ConfigMap: Get https://172.30.0.1:443/api/v1/configmaps?resourceVersion=82686&timeoutSeconds=493&watch=true: dial tcp 172.30.0.1:443: connect: connection refused
W0513 01:27:16.623106       1 reflector.go:270] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: watch of *v1.ClusterRole ended with: too old resource version: 76975 (80136)
W0513 01:27:16.623251       1 reflector.go:270] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: watch of *v1.PersistentVolumeClaim ended with: too old resource version: 77503 (80119)
W0513 01:27:17.507113       1 reflector.go:270] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: watch of *v1.Elasticsearch ended with: too old resource version: 82214 (82692)
time="2020-05-13T01:27:43Z" level=info msg="Timed out waiting for node elasticsearch-cdm-m2j2lxw9-2 to rollout"
time="2020-05-13T01:27:43Z" level=warning msg="Failed to progress update of unschedulable node 'elasticsearch-cdm-m2j2lxw9-2': timed out waiting for the condition"
time="2020-05-13T01:27:43Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-m2j2lxw9-1: red / green"
time="2020-05-13T01:27:43Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-m2j2lxw9-1: Cluster not in green state before beginning upgrade: red"
time="2020-05-13T01:27:48Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-m2j2lxw9-3: red / green"
time="2020-05-13T01:27:48Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-m2j2lxw9-3: Cluster not in green state before beginning upgrade: red"
time="2020-05-13T01:27:59Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-m2j2lxw9-1: red / green"
time="2020-05-13T01:27:59Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-m2j2lxw9-1: Cluster not in green state before beginning upgrade: red"
time="2020-05-13T01:27:59Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-m2j2lxw9-3: red / green"
time="2020-05-13T01:27:59Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-m2j2lxw9-3: Cluster not in green state before beginning upgrade: red"
time="2020-05-13T01:28:14Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-m2j2lxw9-1: yellow / green"
time="2020-05-13T01:28:14Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-m2j2lxw9-1: Cluster not in green state before beginning upgrade: yellow"
time="2020-05-13T01:28:14Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-m2j2lxw9-3: yellow / green"
time="2020-05-13T01:28:14Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-m2j2lxw9-3: Cluster not in green state before beginning upgrade: yellow"
time="2020-05-13T01:28:15Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-m2j2lxw9-1: yellow / green"
time="2020-05-13T01:28:15Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-m2j2lxw9-1: Cluster not in green state before beginning upgrade: yellow"
time="2020-05-13T01:28:15Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-m2j2lxw9-3: yellow / green"
time="2020-05-13T01:28:15Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-m2j2lxw9-3: Cluster not in green state before beginning upgrade: yellow"
time="2020-05-13T01:28:29Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-m2j2lxw9-1: yellow / green"
time="2020-05-13T01:28:29Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-m2j2lxw9-1: Cluster not in green state before beginning upgrade: yellow"
time="2020-05-13T01:28:29Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-m2j2lxw9-3: yellow / green"
time="2020-05-13T01:28:29Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-m2j2lxw9-3: Cluster not in green state before beginning upgrade: yellow"
time="2020-05-13T01:29:00Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-m2j2lxw9-1: yellow / green"
time="2020-05-13T01:29:00Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-m2j2lxw9-1: Cluster not in green state before beginning upgrade: yellow"
time="2020-05-13T01:29:00Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-m2j2lxw9-3: yellow / green"


Version-Release number of selected component (if applicable):
clusterlogging.4.4.0-202005120551
elasticsearch-operator.4.4.0-202005120551

cluster version: from 4.4.3 to 4.5.0-0.nightly-2020-05-12-035058

How reproducible:
2/3

Steps to Reproduce:
1. deploy logging 4.4 on a 4.4 cluster
2. upgrade cluster to 4.5
3. check ES status

Actual results:


Expected results:


Additional info:
Workaround:
manually set shardAllocationEnabled to all after the cluster upgrade finished:

es_util --query=_cluster/settings -XPUT -d '{
    "transient" : {
        "cluster.routing.allocation.enable" : "all"
    }
}'

Comment 3 Anping Li 2020-05-23 12:12:56 UTC

@Periklis, 
I think that is different.  For example : The status is Yellow when cluster.routing.allocation.enable=all. ( https://bugzilla.redhat.com/show_bug.cgi?id=1838153)

#oc rsh elasticsearch-cdm-xv9zo8gz-1-cbbd47549-5ksk4
sh-4.2$ es_cluster_health
{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 16,
  "active_shards" : 16,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 10,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 61.53846153846154
}
$ es_util --query=_cluster/settings
{"persistent":{"discovery":{"zen":{"minimum_master_nodes":"1"}}},"transient":{"cluster":{"routing":{"allocation":{"enable":"all"}}}}}sh-4.2$ 
sh-4.2$ 
sh-4.2$ 
sh-4.2$ es_util --query=_cat/shards
.kibana                                                          0 p STARTED        1   3.2kb 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1
project.logjsonx.9b993d19-0818-4232-8940-cf06a750e965.2020.05.23 0 p STARTED      746 485.2kb 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1
infra-write                                                      1 p STARTED        0    162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1
infra-write                                                      1 r UNASSIGNED                           
infra-write                                                      4 p STARTED        0    162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1
infra-write                                                      4 r UNASSIGNED                           
infra-write                                                      2 p STARTED        0    162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1
infra-write                                                      2 r UNASSIGNED                           
infra-write                                                      3 p STARTED        0    162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1
infra-write                                                      3 r UNASSIGNED                           
infra-write                                                      0 p STARTED        0    162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1
infra-write                                                      0 r UNASSIGNED                           
app-write                                                        1 p STARTED        0    162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1
app-write                                                        1 r UNASSIGNED                           
app-write                                                        4 p STARTED        0    162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1
app-write                                                        4 r UNASSIGNED                           
app-write                                                        2 p STARTED        0    162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1
app-write                                                        2 r UNASSIGNED                           
app-write                                                        3 p STARTED        0    162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1
app-write                                                        3 r UNASSIGNED                           
app-write                                                        0 p STARTED        0    162b 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1
app-write                                                        0 r UNASSIGNED                           
project.logflatx.b1c056f0-4405-45bd-8cea-76338862d9ed.2020.05.23 0 p STARTED      746 542.9kb 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1
.searchguard                                                     0 p STARTED        5 145.3kb 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1
.kibana.647a750f1787408bf50088234ec0edd5a6a9b2ac                 0 p STARTED        4  64.1kb 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1
.operations.2020.05.23                                           0 p STARTED    31341  30.9mb 10.129.2.58 elasticsearch-cdm-xv9zo8gz-1

Comment 4 ewolinet 2020-05-26 21:37:31 UTC

Based on your EO logs, it seems like the pods were originally unable to be deployed (by the scheduler) out so two of them bypassed the normal upgrade path:

level=info msg="Requested to update node 'elasticsearch-cdm-m2j2lxw9-3', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"
level=info msg="Requested to update node 'elasticsearch-cdm-m2j2lxw9-2', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"



Then we see we timed out waiting on the second one to rollout, but eventually succeed and moved to do a normal upgrade on the last node:

level=info msg="Timed out waiting for node elasticsearch-cdm-m2j2lxw9-2 to rollout"
level=warning msg="Failed to progress update of unschedulable node 'elasticsearch-cdm-m2j2lxw9-2': timed out waiting for the condition"

level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-m2j2lxw9-1: red / green"


The odd thing is, that bypassing logic doesn't do anything with changing the shard allocation for the cluster. So it's possible it was leftover from a prior upgrade?
Also, looking at the elasticsearch CR only one of the nodes it noted to be upgraded in the status... 

I'll see if I can recreate this.

Comment 5 Qiaoling Tang 2020-05-27 05:15:48 UTC

Per my understanding, the pods were originally unable to be deployed is because during upgrading cluster from 4.4 to 4.5, it should do upgrading on every node, and the nodes were schedule disabled when they were under upgrading.

I had checked the `shardAllocationEnabled` status before the cluster was upgraded, it was `all` and seems everything worked well.

Comment 6 ewolinet 2020-05-27 20:44:35 UTC

Can you retest this? We have since updated the way we do our upgrades to not use the shard allocation of "none" per https://github.com/openshift/elasticsearch-operator/pull/355

Comment 7 Anping Li 2020-05-31 15:42:28 UTC

It seems the transient.cluster.routing.allocation.enable is none default . If there are new indices, the CLO will change it to all momentarily.

Comment 8 Anping Li 2020-06-01 09:09:36 UTC

Move to verified. The Logging can be upgraded event  transient.cluster.routing.allocation.enable is none. and 4.5, the transient.cluster.routing.allocation.enable=all.

Comment 14 errata-xmlrpc 2020-07-13 17:38:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409