Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1841832

Summary: EO operator can't talk to itself after 4.4 -> 4.5 upgrade
Product: OpenShift Container Platform Reporter: IgorKarpukhin <ikarpukh>
Component: LoggingAssignee: ewolinet
Status: CLOSED ERRATA QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.5CC: aos-bugs, jcantril
Target Milestone: ---Keywords: TestBlocker
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: needsqa
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:01:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1843715    

Description IgorKarpukhin 2020-05-29 15:25:00 UTC
Description of problem:
EO operator can't access EO-service:
time="2020-05-29T13:28:07Z" level=warning msg="Unable to perform synchronized flush: Post https://elasticsearch.openshift-logging.svc:9200/_flush/synced: dial tcp 172.30.212.255:9200: i/o timeout"



Version-Release number of selected component (if applicable): 4.5


How reproducible: 
Upgrade 4.4 CLO and EO with CLO's instance created.


Steps to Reproduce:
1. Install 4.4 CLO and EO
2. Create CLO's CR
3. Upgrade EO to 4.5


Actual results:
Elasticsearch cluster can't start


Expected results:
Elasticsearch cluster upgraded, up and running

Additional info:

Comment 1 IgorKarpukhin 2020-05-29 15:27:47 UTC
EO LOGS:

time="2020-05-29T13:28:07Z" level=warning msg="Unable to perform synchronized flush: Post https://elasticsearch.openshift-logging.svc:9200/_flush/synced: dial tcp 172.30.212.255:9200: i/o timeout"
time="2020-05-29T13:29:43Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:29:43Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-tudtgwxx-1: Node elasticsearch-cdm-tudtgwxx-1 has not rejoined cluster elasticsearch yet"
time="2020-05-29T13:30:13Z" level=info msg="Waiting for cluster to be recovered before upgrading elasticsearch-cdm-tudtgwxx-2:  / [yellow green]"
time="2020-05-29T13:30:13Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-tudtgwxx-2: Cluster not in at least yellow state before beginning upgrade: "
time="2020-05-29T13:30:43Z" level=info msg="Waiting for cluster to be recovered before upgrading elasticsearch-cdm-tudtgwxx-3:  / [yellow green]"
time="2020-05-29T13:30:43Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-tudtgwxx-3: Cluster not in at least yellow state before beginning upgrade: "
time="2020-05-29T13:33:45Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:36:47Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:39:49Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:42:51Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:45:54Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:48:56Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"
time="2020-05-29T13:51:58Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster"


OPENSHIFT-LOGGING NAMESPACE STATUS:
cluster-logging-operator-7d7bbc88f5-vvv75       1/1       Running     0          72m
curator-1590759000-7jkdq                        1/1       Running     0          56m
curator-1590760200-9pfq2                        1/1       Running     0          36m
curator-1590762000-fm7t7                        0/1       Completed   0          6m34s
elasticsearch-cdm-tudtgwxx-1-85d69b58fc-cc5dg   1/2       Running     0          3m5s
elasticsearch-cdm-tudtgwxx-2-b888c4d44-djvnf    1/2       Running     0          2m22s
elasticsearch-cdm-tudtgwxx-3-69db67bd47-8hdpt   1/2       Running     0          34s
fluentd-7mblm                                   1/1       Running     0          71m
fluentd-82vkv                                   1/1       Running     0          71m
fluentd-qvkkd                                   1/1       Running     0          71m
fluentd-smhcc                                   1/1       Running     0          71m
fluentd-tfjqj                                   1/1       Running     0          71m
fluentd-zcwl7                                   1/1       Running     0          71m
kibana-7676965bcf-dn65g                         2/2       Running     0          71m

Comment 2 Anping Li 2020-06-03 04:23:33 UTC
After I deleted all ES pods, the new ES pods became running.  What is the root cause? 

cluster-logging-operator-98f5c5fd-4pw4x         1/1     Running     0          20m
curator-1591153800-zb8lb                        0/1     Completed   0          66m
curator-1591157400-6gdbs                        0/1     Error       0          6m45s
elasticsearch-cdm-knaloezd-1-7bbdf76f85-z2pqv   2/2     Running     0          22m
elasticsearch-cdm-knaloezd-2-5d9d75f6fb-qs7gc   2/2     Running     0          22m
elasticsearch-cdm-knaloezd-3-86dd567d7-b2vlz    2/2     Running     0          22m
elasticsearch-delete-app-1591157700-4gmcl       0/1     Completed   0          112s
elasticsearch-delete-audit-1591157700-tm2lh     0/1     Completed   0          112s
elasticsearch-delete-infra-1591157700-jlhmg     0/1     Completed   0          112s
elasticsearch-rollover-app-1591157700-5l9rh     0/1     Error       0          112s
elasticsearch-rollover-audit-1591157700-sm7rm   0/1     Error       0          112s
elasticsearch-rollover-infra-1591157700-78mpt   0/1     Error       0          112s

Comment 3 Anping Li 2020-06-03 09:06:19 UTC
This network proxy works for me. 

    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: restricted-es-policy
      namespace: openshift-logging
    spec:
      ingress:
      - from:
        - namespaceSelector:
            matchLabels:
              openshift.io/cluster-logging: "true"
          podSelector:
            matchLabels:
              name: elasticsearch-operator
        - podSelector:
            matchLabels:
              component: elasticsearch
      podSelector:
        matchLabels:
          component: elasticsearch
      policyTypes:
      - Ingress

Comment 6 Anping Li 2020-06-04 10:58:26 UTC
The EO traffic is still blocked. image: registry.svc.ci.openshift.org/origin/4.6:elasticsearch-operator


$ oc logs elasticsearch-operator-8658774bd6-7rrgn -f
{"level":"info","ts":1591267087.5367284,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"proxyconfig-controller","worker count":1}
time="2020-06-04T10:38:18Z" level=warning msg="when trying to perform full cluster restart: Unable to set shard allocation to primaries: Put https://elasticsearch.openshift-logging.svc:9200/_cluster/settings: net/http: TLS handshake timeout"
time="2020-06-04T10:40:18Z" level=warning msg="when trying to get LowestClusterVersion: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/stats/nodes/_all: dial tcp 172.30.82.30:9200: i/o timeout"
time="2020-06-04T10:42:19Z" level=warning msg="when trying to get LowestClusterVersion: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/stats/nodes/_all: dial tcp 172.30.82.30:9200: i/o timeout"
time="2020-06-04T10:44:19Z" level=warning msg="when trying to get LowestClusterVersion: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/stats/nodes/_all: dial tcp 172.30.82.30:9200: i/o timeout"

$ oc get networkpolicy restricted-es-policy -o json |jq '.spec'
{
  "ingress": [
    {
      "from": [
        {
          "namespaceSelector": {},
          "podSelector": {
            "matchLabels": {
              "name": "elasticsearch-operator"
            }
          }
        }
      ],
      "ports": [
        {
          "port": 9200,
          "protocol": "TCP"
        }
      ]
    },
    {
      "from": [
        {
          "podSelector": {
            "matchLabels": {
              "component": "elasticsearch"
            }
          }
        }
      ],
      "ports": [
        {
          "port": 9200,
          "protocol": "TCP"
        }
      ]
    },
    {
      "from": [
        {
          "podSelector": {
            "matchLabels": {
              "component": "elasticsearch"
            }
          }
        }
      ],
      "ports": [
        {
          "port": 9300,
          "protocol": "TCP"
        }
      ]
    }
  ],
  "podSelector": {
    "matchLabels": {
      "component": "elasticsearch"
    }
  },
  "policyTypes": [
    "Ingress"
  ]
}

Comment 7 Anping Li 2020-06-04 11:26:40 UTC
After I applied the cr in https://bugzilla.redhat.com/show_bug.cgi?id=1841832#c3. It may work.   Not always works, it seems networkpolicy isn't reliable when the policy is changed.

Comment 9 Anping Li 2020-06-09 08:45:39 UTC
Verified, the EO and CLO can be upgraded. and after upgrade. it works well.

Comment 11 errata-xmlrpc 2020-10-27 16:01:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196