Bug 1841832 - EO operator can't talk to itself after 4.4 -> 4.5 upgrade
Summary: EO operator can't talk to itself after 4.4 -> 4.5 upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.6.0
Assignee: ewolinet
QA Contact: Anping Li
URL:
Whiteboard: needsqa
Depends On:
Blocks: 1843715
TreeView+ depends on / blocked
 
Reported: 2020-05-29 15:25 UTC by IgorKarpukhin
Modified: 2020-10-27 16:02 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:01:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift elasticsearch-operator pull 367 0 None closed Bug 1841832: network policy to allow ES pod loopback and intercluster communication 2020-12-02 15:09:35 UTC
Github openshift elasticsearch-operator pull 383 0 None closed Bug 1841832: Removing calls to enforce and relax networkpolicy 2020-12-02 15:09:36 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:02:28 UTC

Description IgorKarpukhin 2020-05-29 15:25:00 UTC
Description of problem:
EO operator can't access EO-service:
time="2020-05-29T13:28:07Z" level=warning msg="Unable to perform synchronized flush: Post https://elasticsearch.openshift-logging.svc:9200/_flush/synced: dial tcp 172.30.212.255:9200: i/o timeout"



Version-Release number of selected component (if applicable): 4.5


How reproducible: 
Upgrade 4.4 CLO and EO with CLO's instance created.


Steps to Reproduce:
1. Install 4.4 CLO and EO
2. Create CLO's CR
3. Upgrade EO to 4.5


Actual results:
Elasticsearch cluster can't start


Expected results:
Elasticsearch cluster upgraded, up and running

Additional info:

Comment 1 IgorKarpukhin 2020-05-29 15:27:47 UTC
EO LOGS:

time="2020-05-29T13:28:07Z" level=warning msg="Unable to perform synchronized flush: Post https://elasticsearch.openshift-logging.svc:9200/_flush/synced: dial tcp 172.30.212.255:9200: i/o timeout"
time="2020-05-29T13:29:43Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:29:43Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-tudtgwxx-1: Node elasticsearch-cdm-tudtgwxx-1 has not rejoined cluster elasticsearch yet"
time="2020-05-29T13:30:13Z" level=info msg="Waiting for cluster to be recovered before upgrading elasticsearch-cdm-tudtgwxx-2:  / [yellow green]"
time="2020-05-29T13:30:13Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-tudtgwxx-2: Cluster not in at least yellow state before beginning upgrade: "
time="2020-05-29T13:30:43Z" level=info msg="Waiting for cluster to be recovered before upgrading elasticsearch-cdm-tudtgwxx-3:  / [yellow green]"
time="2020-05-29T13:30:43Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-tudtgwxx-3: Cluster not in at least yellow state before beginning upgrade: "
time="2020-05-29T13:33:45Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:36:47Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:39:49Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:42:51Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:45:54Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:48:56Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:51:58Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"


OPENSHIFT-LOGGING NAMESPACE STATUS:
cluster-logging-operator-7d7bbc88f5-vvv75       1/1       Running     0          72m
curator-1590759000-7jkdq                        1/1       Running     0          56m
curator-1590760200-9pfq2                        1/1       Running     0          36m
curator-1590762000-fm7t7                        0/1       Completed   0          6m34s
elasticsearch-cdm-tudtgwxx-1-85d69b58fc-cc5dg   1/2       Running     0          3m5s
elasticsearch-cdm-tudtgwxx-2-b888c4d44-djvnf    1/2       Running     0          2m22s
elasticsearch-cdm-tudtgwxx-3-69db67bd47-8hdpt   1/2       Running     0          34s
fluentd-7mblm                                   1/1       Running     0          71m
fluentd-82vkv                                   1/1       Running     0          71m
fluentd-qvkkd                                   1/1       Running     0          71m
fluentd-smhcc                                   1/1       Running     0          71m
fluentd-tfjqj                                   1/1       Running     0          71m
fluentd-zcwl7                                   1/1       Running     0          71m
kibana-7676965bcf-dn65g                         2/2       Running     0          71m

Comment 2 Anping Li 2020-06-03 04:23:33 UTC
After I deleted all ES pods, the new ES pods became running.  What is the root cause? 

cluster-logging-operator-98f5c5fd-4pw4x         1/1     Running     0          20m
curator-1591153800-zb8lb                        0/1     Completed   0          66m
curator-1591157400-6gdbs                        0/1     Error       0          6m45s
elasticsearch-cdm-knaloezd-1-7bbdf76f85-z2pqv   2/2     Running     0          22m
elasticsearch-cdm-knaloezd-2-5d9d75f6fb-qs7gc   2/2     Running     0          22m
elasticsearch-cdm-knaloezd-3-86dd567d7-b2vlz    2/2     Running     0          22m
elasticsearch-delete-app-1591157700-4gmcl       0/1     Completed   0          112s
elasticsearch-delete-audit-1591157700-tm2lh     0/1     Completed   0          112s
elasticsearch-delete-infra-1591157700-jlhmg     0/1     Completed   0          112s
elasticsearch-rollover-app-1591157700-5l9rh     0/1     Error       0          112s
elasticsearch-rollover-audit-1591157700-sm7rm   0/1     Error       0          112s
elasticsearch-rollover-infra-1591157700-78mpt   0/1     Error       0          112s

Comment 3 Anping Li 2020-06-03 09:06:19 UTC
This network proxy works for me. 

    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: restricted-es-policy
      namespace: openshift-logging
    spec:
      ingress:
      - from:
        - namespaceSelector:
            matchLabels:
              openshift.io/cluster-logging: "true"
          podSelector:
            matchLabels:
              name: elasticsearch-operator
        - podSelector:
            matchLabels:
              component: elasticsearch
      podSelector:
        matchLabels:
          component: elasticsearch
      policyTypes:
      - Ingress

Comment 6 Anping Li 2020-06-04 10:58:26 UTC
The EO traffic is still blocked. image: registry.svc.ci.openshift.org/origin/4.6:elasticsearch-operator


$ oc logs elasticsearch-operator-8658774bd6-7rrgn -f
{"level":"info","ts":1591267087.5367284,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"proxyconfig-controller","worker count":1}
time="2020-06-04T10:38:18Z" level=warning msg="when trying to perform full cluster restart: Unable to set shard allocation to primaries: Put https://elasticsearch.openshift-logging.svc:9200/_cluster/settings: net/http: TLS handshake timeout"
time="2020-06-04T10:40:18Z" level=warning msg="when trying to get LowestClusterVersion: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/stats/nodes/_all: dial tcp 172.30.82.30:9200: i/o timeout"
time="2020-06-04T10:42:19Z" level=warning msg="when trying to get LowestClusterVersion: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/stats/nodes/_all: dial tcp 172.30.82.30:9200: i/o timeout"
time="2020-06-04T10:44:19Z" level=warning msg="when trying to get LowestClusterVersion: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/stats/nodes/_all: dial tcp 172.30.82.30:9200: i/o timeout"

$ oc get networkpolicy restricted-es-policy -o json |jq '.spec'
{
  "ingress": [
    {
      "from": [
        {
          "namespaceSelector": {},
          "podSelector": {
            "matchLabels": {
              "name": "elasticsearch-operator"
            }
          }
        }
      ],
      "ports": [
        {
          "port": 9200,
          "protocol": "TCP"
        }
      ]
    },
    {
      "from": [
        {
          "podSelector": {
            "matchLabels": {
              "component": "elasticsearch"
            }
          }
        }
      ],
      "ports": [
        {
          "port": 9200,
          "protocol": "TCP"
        }
      ]
    },
    {
      "from": [
        {
          "podSelector": {
            "matchLabels": {
              "component": "elasticsearch"
            }
          }
        }
      ],
      "ports": [
        {
          "port": 9300,
          "protocol": "TCP"
        }
      ]
    }
  ],
  "podSelector": {
    "matchLabels": {
      "component": "elasticsearch"
    }
  },
  "policyTypes": [
    "Ingress"
  ]
}

Comment 7 Anping Li 2020-06-04 11:26:40 UTC
After I applied the cr in https://bugzilla.redhat.com/show_bug.cgi?id=1841832#c3. It may work.   Not always works, it seems networkpolicy isn't reliable when the policy is changed.

Comment 9 Anping Li 2020-06-09 08:45:39 UTC
Verified, the EO and CLO can be upgraded. and after upgrade. it works well.

Comment 11 errata-xmlrpc 2020-10-27 16:01:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.