Description of problem: EO operator can't access EO-service: time="2020-05-29T13:28:07Z" level=warning msg="Unable to perform synchronized flush: Post https://elasticsearch.openshift-logging.svc:9200/_flush/synced: dial tcp 172.30.212.255:9200: i/o timeout" Version-Release number of selected component (if applicable): 4.5 How reproducible: Upgrade 4.4 CLO and EO with CLO's instance created. Steps to Reproduce: 1. Install 4.4 CLO and EO 2. Create CLO's CR 3. Upgrade EO to 4.5 Actual results: Elasticsearch cluster can't start Expected results: Elasticsearch cluster upgraded, up and running Additional info:
EO LOGS: time="2020-05-29T13:28:07Z" level=warning msg="Unable to perform synchronized flush: Post https://elasticsearch.openshift-logging.svc:9200/_flush/synced: dial tcp 172.30.212.255:9200: i/o timeout" time="2020-05-29T13:29:43Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster" time="2020-05-29T13:29:43Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-tudtgwxx-1: Node elasticsearch-cdm-tudtgwxx-1 has not rejoined cluster elasticsearch yet" time="2020-05-29T13:30:13Z" level=info msg="Waiting for cluster to be recovered before upgrading elasticsearch-cdm-tudtgwxx-2: / [yellow green]" time="2020-05-29T13:30:13Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-tudtgwxx-2: Cluster not in at least yellow state before beginning upgrade: " time="2020-05-29T13:30:43Z" level=info msg="Waiting for cluster to be recovered before upgrading elasticsearch-cdm-tudtgwxx-3: / [yellow green]" time="2020-05-29T13:30:43Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-tudtgwxx-3: Cluster not in at least yellow state before beginning upgrade: " time="2020-05-29T13:33:45Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster" time="2020-05-29T13:36:47Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster" time="2020-05-29T13:39:49Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster" time="2020-05-29T13:42:51Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster" time="2020-05-29T13:45:54Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster" time="2020-05-29T13:48:56Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster" time="2020-05-29T13:51:58Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster" OPENSHIFT-LOGGING NAMESPACE STATUS: cluster-logging-operator-7d7bbc88f5-vvv75 1/1 Running 0 72m curator-1590759000-7jkdq 1/1 Running 0 56m curator-1590760200-9pfq2 1/1 Running 0 36m curator-1590762000-fm7t7 0/1 Completed 0 6m34s elasticsearch-cdm-tudtgwxx-1-85d69b58fc-cc5dg 1/2 Running 0 3m5s elasticsearch-cdm-tudtgwxx-2-b888c4d44-djvnf 1/2 Running 0 2m22s elasticsearch-cdm-tudtgwxx-3-69db67bd47-8hdpt 1/2 Running 0 34s fluentd-7mblm 1/1 Running 0 71m fluentd-82vkv 1/1 Running 0 71m fluentd-qvkkd 1/1 Running 0 71m fluentd-smhcc 1/1 Running 0 71m fluentd-tfjqj 1/1 Running 0 71m fluentd-zcwl7 1/1 Running 0 71m kibana-7676965bcf-dn65g 2/2 Running 0 71m
After I deleted all ES pods, the new ES pods became running. What is the root cause? cluster-logging-operator-98f5c5fd-4pw4x 1/1 Running 0 20m curator-1591153800-zb8lb 0/1 Completed 0 66m curator-1591157400-6gdbs 0/1 Error 0 6m45s elasticsearch-cdm-knaloezd-1-7bbdf76f85-z2pqv 2/2 Running 0 22m elasticsearch-cdm-knaloezd-2-5d9d75f6fb-qs7gc 2/2 Running 0 22m elasticsearch-cdm-knaloezd-3-86dd567d7-b2vlz 2/2 Running 0 22m elasticsearch-delete-app-1591157700-4gmcl 0/1 Completed 0 112s elasticsearch-delete-audit-1591157700-tm2lh 0/1 Completed 0 112s elasticsearch-delete-infra-1591157700-jlhmg 0/1 Completed 0 112s elasticsearch-rollover-app-1591157700-5l9rh 0/1 Error 0 112s elasticsearch-rollover-audit-1591157700-sm7rm 0/1 Error 0 112s elasticsearch-rollover-infra-1591157700-78mpt 0/1 Error 0 112s
This network proxy works for me. apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: restricted-es-policy namespace: openshift-logging spec: ingress: - from: - namespaceSelector: matchLabels: openshift.io/cluster-logging: "true" podSelector: matchLabels: name: elasticsearch-operator - podSelector: matchLabels: component: elasticsearch podSelector: matchLabels: component: elasticsearch policyTypes: - Ingress
The EO traffic is still blocked. image: registry.svc.ci.openshift.org/origin/4.6:elasticsearch-operator $ oc logs elasticsearch-operator-8658774bd6-7rrgn -f {"level":"info","ts":1591267087.5367284,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"proxyconfig-controller","worker count":1} time="2020-06-04T10:38:18Z" level=warning msg="when trying to perform full cluster restart: Unable to set shard allocation to primaries: Put https://elasticsearch.openshift-logging.svc:9200/_cluster/settings: net/http: TLS handshake timeout" time="2020-06-04T10:40:18Z" level=warning msg="when trying to get LowestClusterVersion: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/stats/nodes/_all: dial tcp 172.30.82.30:9200: i/o timeout" time="2020-06-04T10:42:19Z" level=warning msg="when trying to get LowestClusterVersion: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/stats/nodes/_all: dial tcp 172.30.82.30:9200: i/o timeout" time="2020-06-04T10:44:19Z" level=warning msg="when trying to get LowestClusterVersion: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/stats/nodes/_all: dial tcp 172.30.82.30:9200: i/o timeout" $ oc get networkpolicy restricted-es-policy -o json |jq '.spec' { "ingress": [ { "from": [ { "namespaceSelector": {}, "podSelector": { "matchLabels": { "name": "elasticsearch-operator" } } } ], "ports": [ { "port": 9200, "protocol": "TCP" } ] }, { "from": [ { "podSelector": { "matchLabels": { "component": "elasticsearch" } } } ], "ports": [ { "port": 9200, "protocol": "TCP" } ] }, { "from": [ { "podSelector": { "matchLabels": { "component": "elasticsearch" } } } ], "ports": [ { "port": 9300, "protocol": "TCP" } ] } ], "podSelector": { "matchLabels": { "component": "elasticsearch" } }, "policyTypes": [ "Ingress" ] }
After I applied the cr in https://bugzilla.redhat.com/show_bug.cgi?id=1841832#c3. It may work. Not always works, it seems networkpolicy isn't reliable when the policy is changed.
Verified, the EO and CLO can be upgraded. and after upgrade. it works well.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196