1841832 – EO operator can't talk to itself after 4.4 -> 4.5 upgrade

Bug 1841832 - EO operator can't talk to itself after 4.4 -> 4.5 upgrade

Summary: EO operator can't talk to itself after 4.4 -> 4.5 upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	ewolinet
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:	needsqa
Depends On:
Blocks:	1843715
TreeView+	depends on / blocked

Reported:	2020-05-29 15:25 UTC by IgorKarpukhin
Modified:	2020-10-27 16:02 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:01:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift elasticsearch-operator pull 367	None	closed	Bug 1841832: network policy to allow ES pod loopback and intercluster communication	2020-12-02 15:09:35 UTC
Github	openshift elasticsearch-operator pull 383	None	closed	Bug 1841832: Removing calls to enforce and relax networkpolicy	2020-12-02 15:09:36 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:02:28 UTC

Description IgorKarpukhin 2020-05-29 15:25:00 UTC

Description of problem:
EO operator can't access EO-service:
time="2020-05-29T13:28:07Z" level=warning msg="Unable to perform synchronized flush: Post https://elasticsearch.openshift-logging.svc:9200/_flush/synced: dial tcp 172.30.212.255:9200: i/o timeout"



Version-Release number of selected component (if applicable): 4.5


How reproducible: 
Upgrade 4.4 CLO and EO with CLO's instance created.


Steps to Reproduce:
1. Install 4.4 CLO and EO
2. Create CLO's CR
3. Upgrade EO to 4.5


Actual results:
Elasticsearch cluster can't start


Expected results:
Elasticsearch cluster upgraded, up and running

Additional info:

Comment 1 IgorKarpukhin 2020-05-29 15:27:47 UTC

EO LOGS:

time="2020-05-29T13:28:07Z" level=warning msg="Unable to perform synchronized flush: Post https://elasticsearch.openshift-logging.svc:9200/_flush/synced: dial tcp 172.30.212.255:9200: i/o timeout"
time="2020-05-29T13:29:43Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:29:43Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-tudtgwxx-1: Node elasticsearch-cdm-tudtgwxx-1 has not rejoined cluster elasticsearch yet"
time="2020-05-29T13:30:13Z" level=info msg="Waiting for cluster to be recovered before upgrading elasticsearch-cdm-tudtgwxx-2:  / [yellow green]"
time="2020-05-29T13:30:13Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-tudtgwxx-2: Cluster not in at least yellow state before beginning upgrade: "
time="2020-05-29T13:30:43Z" level=info msg="Waiting for cluster to be recovered before upgrading elasticsearch-cdm-tudtgwxx-3:  / [yellow green]"
time="2020-05-29T13:30:43Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-tudtgwxx-3: Cluster not in at least yellow state before beginning upgrade: "
time="2020-05-29T13:33:45Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:36:47Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:39:49Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:42:51Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:45:54Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:48:56Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:51:58Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"


OPENSHIFT-LOGGING NAMESPACE STATUS:
cluster-logging-operator-7d7bbc88f5-vvv75       1/1       Running     0          72m
curator-1590759000-7jkdq                        1/1       Running     0          56m
curator-1590760200-9pfq2                        1/1       Running     0          36m
curator-1590762000-fm7t7                        0/1       Completed   0          6m34s
elasticsearch-cdm-tudtgwxx-1-85d69b58fc-cc5dg   1/2       Running     0          3m5s
elasticsearch-cdm-tudtgwxx-2-b888c4d44-djvnf    1/2       Running     0          2m22s
elasticsearch-cdm-tudtgwxx-3-69db67bd47-8hdpt   1/2       Running     0          34s
fluentd-7mblm                                   1/1       Running     0          71m
fluentd-82vkv                                   1/1       Running     0          71m
fluentd-qvkkd                                   1/1       Running     0          71m
fluentd-smhcc                                   1/1       Running     0          71m
fluentd-tfjqj                                   1/1       Running     0          71m
fluentd-zcwl7                                   1/1       Running     0          71m
kibana-7676965bcf-dn65g                         2/2       Running     0          71m

Comment 2 Anping Li 2020-06-03 04:23:33 UTC

After I deleted all ES pods, the new ES pods became running.  What is the root cause? 

cluster-logging-operator-98f5c5fd-4pw4x         1/1     Running     0          20m
curator-1591153800-zb8lb                        0/1     Completed   0          66m
curator-1591157400-6gdbs                        0/1     Error       0          6m45s
elasticsearch-cdm-knaloezd-1-7bbdf76f85-z2pqv   2/2     Running     0          22m
elasticsearch-cdm-knaloezd-2-5d9d75f6fb-qs7gc   2/2     Running     0          22m
elasticsearch-cdm-knaloezd-3-86dd567d7-b2vlz    2/2     Running     0          22m
elasticsearch-delete-app-1591157700-4gmcl       0/1     Completed   0          112s
elasticsearch-delete-audit-1591157700-tm2lh     0/1     Completed   0          112s
elasticsearch-delete-infra-1591157700-jlhmg     0/1     Completed   0          112s
elasticsearch-rollover-app-1591157700-5l9rh     0/1     Error       0          112s
elasticsearch-rollover-audit-1591157700-sm7rm   0/1     Error       0          112s
elasticsearch-rollover-infra-1591157700-78mpt   0/1     Error       0          112s

Comment 3 Anping Li 2020-06-03 09:06:19 UTC

This network proxy works for me. 

    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: restricted-es-policy
      namespace: openshift-logging
    spec:
      ingress:
      - from:
        - namespaceSelector:
            matchLabels:
              openshift.io/cluster-logging: "true"
          podSelector:
            matchLabels:
              name: elasticsearch-operator
        - podSelector:
            matchLabels:
              component: elasticsearch
      podSelector:
        matchLabels:
          component: elasticsearch
      policyTypes:
      - Ingress

Comment 6 Anping Li 2020-06-04 10:58:26 UTC

The EO traffic is still blocked. image: registry.svc.ci.openshift.org/origin/4.6:elasticsearch-operator


$ oc logs elasticsearch-operator-8658774bd6-7rrgn -f
{"level":"info","ts":1591267087.5367284,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"proxyconfig-controller","worker count":1}
time="2020-06-04T10:38:18Z" level=warning msg="when trying to perform full cluster restart: Unable to set shard allocation to primaries: Put https://elasticsearch.openshift-logging.svc:9200/_cluster/settings: net/http: TLS handshake timeout"
time="2020-06-04T10:40:18Z" level=warning msg="when trying to get LowestClusterVersion: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/stats/nodes/_all: dial tcp 172.30.82.30:9200: i/o timeout"
time="2020-06-04T10:42:19Z" level=warning msg="when trying to get LowestClusterVersion: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/stats/nodes/_all: dial tcp 172.30.82.30:9200: i/o timeout"
time="2020-06-04T10:44:19Z" level=warning msg="when trying to get LowestClusterVersion: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/stats/nodes/_all: dial tcp 172.30.82.30:9200: i/o timeout"

$ oc get networkpolicy restricted-es-policy -o json |jq '.spec'
{
  "ingress": [
    {
      "from": [
        {
          "namespaceSelector": {},
          "podSelector": {
            "matchLabels": {
              "name": "elasticsearch-operator"
            }
          }
        }
      ],
      "ports": [
        {
          "port": 9200,
          "protocol": "TCP"
        }
      ]
    },
    {
      "from": [
        {
          "podSelector": {
            "matchLabels": {
              "component": "elasticsearch"
            }
          }
        }
      ],
      "ports": [
        {
          "port": 9200,
          "protocol": "TCP"
        }
      ]
    },
    {
      "from": [
        {
          "podSelector": {
            "matchLabels": {
              "component": "elasticsearch"
            }
          }
        }
      ],
      "ports": [
        {
          "port": 9300,
          "protocol": "TCP"
        }
      ]
    }
  ],
  "podSelector": {
    "matchLabels": {
      "component": "elasticsearch"
    }
  },
  "policyTypes": [
    "Ingress"
  ]
}

Comment 7 Anping Li 2020-06-04 11:26:40 UTC

After I applied the cr in https://bugzilla.redhat.com/show_bug.cgi?id=1841832#c3. It may work.   Not always works, it seems networkpolicy isn't reliable when the policy is changed.

Comment 9 Anping Li 2020-06-09 08:45:39 UTC

Verified, the EO and CLO can be upgraded. and after upgrade. it works well.

Comment 11 errata-xmlrpc 2020-10-27 16:01:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.