Bug 1843715 - EO operator can't talk to itself after 4.4 -> 4.5 upgrade
Summary: EO operator can't talk to itself after 4.4 -> 4.5 upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.5.0
Assignee: ewolinet
QA Contact: Anping Li
URL:
Whiteboard:
Depends On: 1841832
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-03 22:25 UTC by OpenShift BugZilla Robot
Modified: 2020-07-13 17:43 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-13 17:43:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift elasticsearch-operator pull 386 0 None open [release-4.5] Bug 1843715: Perform a full cluster restart when upgrading ES maj versions 2020-06-24 03:33:02 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:43:28 UTC

Description OpenShift BugZilla Robot 2020-06-03 22:25:42 UTC
+++ This bug was initially created as a clone of Bug #1841832 +++

Description of problem:
EO operator can't access EO-service:
time="2020-05-29T13:28:07Z" level=warning msg="Unable to perform synchronized flush: Post https://elasticsearch.openshift-logging.svc:9200/_flush/synced: dial tcp 172.30.212.255:9200: i/o timeout"



Version-Release number of selected component (if applicable): 4.5


How reproducible: 
Upgrade 4.4 CLO and EO with CLO's instance created.


Steps to Reproduce:
1. Install 4.4 CLO and EO
2. Create CLO's CR
3. Upgrade EO to 4.5


Actual results:
Elasticsearch cluster can't start


Expected results:
Elasticsearch cluster upgraded, up and running

Additional info:

--- Additional comment from ikarpukh on 2020-05-29 15:27:47 UTC ---

EO LOGS:

time="2020-05-29T13:28:07Z" level=warning msg="Unable to perform synchronized flush: Post https://elasticsearch.openshift-logging.svc:9200/_flush/synced: dial tcp 172.30.212.255:9200: i/o timeout"
time="2020-05-29T13:29:43Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:29:43Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-tudtgwxx-1: Node elasticsearch-cdm-tudtgwxx-1 has not rejoined cluster elasticsearch yet"
time="2020-05-29T13:30:13Z" level=info msg="Waiting for cluster to be recovered before upgrading elasticsearch-cdm-tudtgwxx-2:  / [yellow green]"
time="2020-05-29T13:30:13Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-tudtgwxx-2: Cluster not in at least yellow state before beginning upgrade: "
time="2020-05-29T13:30:43Z" level=info msg="Waiting for cluster to be recovered before upgrading elasticsearch-cdm-tudtgwxx-3:  / [yellow green]"
time="2020-05-29T13:30:43Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-tudtgwxx-3: Cluster not in at least yellow state before beginning upgrade: "
time="2020-05-29T13:33:45Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:36:47Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:39:49Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:42:51Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:45:54Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:48:56Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:51:58Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"


OPENSHIFT-LOGGING NAMESPACE STATUS:
cluster-logging-operator-7d7bbc88f5-vvv75       1/1       Running     0          72m
curator-1590759000-7jkdq                        1/1       Running     0          56m
curator-1590760200-9pfq2                        1/1       Running     0          36m
curator-1590762000-fm7t7                        0/1       Completed   0          6m34s
elasticsearch-cdm-tudtgwxx-1-85d69b58fc-cc5dg   1/2       Running     0          3m5s
elasticsearch-cdm-tudtgwxx-2-b888c4d44-djvnf    1/2       Running     0          2m22s
elasticsearch-cdm-tudtgwxx-3-69db67bd47-8hdpt   1/2       Running     0          34s
fluentd-7mblm                                   1/1       Running     0          71m
fluentd-82vkv                                   1/1       Running     0          71m
fluentd-qvkkd                                   1/1       Running     0          71m
fluentd-smhcc                                   1/1       Running     0          71m
fluentd-tfjqj                                   1/1       Running     0          71m
fluentd-zcwl7                                   1/1       Running     0          71m
kibana-7676965bcf-dn65g                         2/2       Running     0          71m

--- Additional comment from anli on 2020-06-03 04:23:33 UTC ---

After I deleted all ES pods, the new ES pods became running.  What is the root cause? 

cluster-logging-operator-98f5c5fd-4pw4x         1/1     Running     0          20m
curator-1591153800-zb8lb                        0/1     Completed   0          66m
curator-1591157400-6gdbs                        0/1     Error       0          6m45s
elasticsearch-cdm-knaloezd-1-7bbdf76f85-z2pqv   2/2     Running     0          22m
elasticsearch-cdm-knaloezd-2-5d9d75f6fb-qs7gc   2/2     Running     0          22m
elasticsearch-cdm-knaloezd-3-86dd567d7-b2vlz    2/2     Running     0          22m
elasticsearch-delete-app-1591157700-4gmcl       0/1     Completed   0          112s
elasticsearch-delete-audit-1591157700-tm2lh     0/1     Completed   0          112s
elasticsearch-delete-infra-1591157700-jlhmg     0/1     Completed   0          112s
elasticsearch-rollover-app-1591157700-5l9rh     0/1     Error       0          112s
elasticsearch-rollover-audit-1591157700-sm7rm   0/1     Error       0          112s
elasticsearch-rollover-infra-1591157700-78mpt   0/1     Error       0          112s

--- Additional comment from anli on 2020-06-03 09:06:19 UTC ---

This network proxy works for me. 

    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: restricted-es-policy
      namespace: openshift-logging
    spec:
      ingress:
      - from:
        - namespaceSelector:
            matchLabels:
              openshift.io/cluster-logging: "true"
          podSelector:
            matchLabels:
              name: elasticsearch-operator
        - podSelector:
            matchLabels:
              component: elasticsearch
      podSelector:
        matchLabels:
          component: elasticsearch
      policyTypes:
      - Ingress

Comment 3 Anping Li 2020-06-10 06:21:07 UTC
Verified on elasticsearch-operator.4.5.0-202006091957

Comment 4 errata-xmlrpc 2020-07-13 17:43:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.