Bug 1841832
| Summary: | EO operator can't talk to itself after 4.4 -> 4.5 upgrade | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | IgorKarpukhin <ikarpukh> |
| Component: | Logging | Assignee: | ewolinet |
| Status: | CLOSED ERRATA | QA Contact: | Anping Li <anli> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.5 | CC: | aos-bugs, jcantril |
| Target Milestone: | --- | Keywords: | TestBlocker |
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | needsqa | ||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-27 16:01:56 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1843715 | ||
|
Description
IgorKarpukhin
2020-05-29 15:25:00 UTC
EO LOGS: time="2020-05-29T13:28:07Z" level=warning msg="Unable to perform synchronized flush: Post https://elasticsearch.openshift-logging.svc:9200/_flush/synced: dial tcp 172.30.212.255:9200: i/o timeout" time="2020-05-29T13:29:43Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster" time="2020-05-29T13:29:43Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-tudtgwxx-1: Node elasticsearch-cdm-tudtgwxx-1 has not rejoined cluster elasticsearch yet" time="2020-05-29T13:30:13Z" level=info msg="Waiting for cluster to be recovered before upgrading elasticsearch-cdm-tudtgwxx-2: / [yellow green]" time="2020-05-29T13:30:13Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-tudtgwxx-2: Cluster not in at least yellow state before beginning upgrade: " time="2020-05-29T13:30:43Z" level=info msg="Waiting for cluster to be recovered before upgrading elasticsearch-cdm-tudtgwxx-3: / [yellow green]" time="2020-05-29T13:30:43Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-tudtgwxx-3: Cluster not in at least yellow state before beginning upgrade: " time="2020-05-29T13:33:45Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster" time="2020-05-29T13:36:47Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster" time="2020-05-29T13:39:49Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster" time="2020-05-29T13:42:51Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster" time="2020-05-29T13:45:54Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster" time="2020-05-29T13:48:56Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster" time="2020-05-29T13:51:58Z" level=info msg="Timed out waiting for elasticsearch-cdm-tudtgwxx-1 to rejoin cluster" OPENSHIFT-LOGGING NAMESPACE STATUS: cluster-logging-operator-7d7bbc88f5-vvv75 1/1 Running 0 72m curator-1590759000-7jkdq 1/1 Running 0 56m curator-1590760200-9pfq2 1/1 Running 0 36m curator-1590762000-fm7t7 0/1 Completed 0 6m34s elasticsearch-cdm-tudtgwxx-1-85d69b58fc-cc5dg 1/2 Running 0 3m5s elasticsearch-cdm-tudtgwxx-2-b888c4d44-djvnf 1/2 Running 0 2m22s elasticsearch-cdm-tudtgwxx-3-69db67bd47-8hdpt 1/2 Running 0 34s fluentd-7mblm 1/1 Running 0 71m fluentd-82vkv 1/1 Running 0 71m fluentd-qvkkd 1/1 Running 0 71m fluentd-smhcc 1/1 Running 0 71m fluentd-tfjqj 1/1 Running 0 71m fluentd-zcwl7 1/1 Running 0 71m kibana-7676965bcf-dn65g 2/2 Running 0 71m After I deleted all ES pods, the new ES pods became running. What is the root cause? cluster-logging-operator-98f5c5fd-4pw4x 1/1 Running 0 20m curator-1591153800-zb8lb 0/1 Completed 0 66m curator-1591157400-6gdbs 0/1 Error 0 6m45s elasticsearch-cdm-knaloezd-1-7bbdf76f85-z2pqv 2/2 Running 0 22m elasticsearch-cdm-knaloezd-2-5d9d75f6fb-qs7gc 2/2 Running 0 22m elasticsearch-cdm-knaloezd-3-86dd567d7-b2vlz 2/2 Running 0 22m elasticsearch-delete-app-1591157700-4gmcl 0/1 Completed 0 112s elasticsearch-delete-audit-1591157700-tm2lh 0/1 Completed 0 112s elasticsearch-delete-infra-1591157700-jlhmg 0/1 Completed 0 112s elasticsearch-rollover-app-1591157700-5l9rh 0/1 Error 0 112s elasticsearch-rollover-audit-1591157700-sm7rm 0/1 Error 0 112s elasticsearch-rollover-infra-1591157700-78mpt 0/1 Error 0 112s This network proxy works for me.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: restricted-es-policy
namespace: openshift-logging
spec:
ingress:
- from:
- namespaceSelector:
matchLabels:
openshift.io/cluster-logging: "true"
podSelector:
matchLabels:
name: elasticsearch-operator
- podSelector:
matchLabels:
component: elasticsearch
podSelector:
matchLabels:
component: elasticsearch
policyTypes:
- Ingress
The EO traffic is still blocked. image: registry.svc.ci.openshift.org/origin/4.6:elasticsearch-operator
$ oc logs elasticsearch-operator-8658774bd6-7rrgn -f
{"level":"info","ts":1591267087.5367284,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"proxyconfig-controller","worker count":1}
time="2020-06-04T10:38:18Z" level=warning msg="when trying to perform full cluster restart: Unable to set shard allocation to primaries: Put https://elasticsearch.openshift-logging.svc:9200/_cluster/settings: net/http: TLS handshake timeout"
time="2020-06-04T10:40:18Z" level=warning msg="when trying to get LowestClusterVersion: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/stats/nodes/_all: dial tcp 172.30.82.30:9200: i/o timeout"
time="2020-06-04T10:42:19Z" level=warning msg="when trying to get LowestClusterVersion: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/stats/nodes/_all: dial tcp 172.30.82.30:9200: i/o timeout"
time="2020-06-04T10:44:19Z" level=warning msg="when trying to get LowestClusterVersion: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/stats/nodes/_all: dial tcp 172.30.82.30:9200: i/o timeout"
$ oc get networkpolicy restricted-es-policy -o json |jq '.spec'
{
"ingress": [
{
"from": [
{
"namespaceSelector": {},
"podSelector": {
"matchLabels": {
"name": "elasticsearch-operator"
}
}
}
],
"ports": [
{
"port": 9200,
"protocol": "TCP"
}
]
},
{
"from": [
{
"podSelector": {
"matchLabels": {
"component": "elasticsearch"
}
}
}
],
"ports": [
{
"port": 9200,
"protocol": "TCP"
}
]
},
{
"from": [
{
"podSelector": {
"matchLabels": {
"component": "elasticsearch"
}
}
}
],
"ports": [
{
"port": 9300,
"protocol": "TCP"
}
]
}
],
"podSelector": {
"matchLabels": {
"component": "elasticsearch"
}
},
"policyTypes": [
"Ingress"
]
}
After I applied the cr in https://bugzilla.redhat.com/show_bug.cgi?id=1841832#c3. It may work. Not always works, it seems networkpolicy isn't reliable when the policy is changed. Verified, the EO and CLO can be upgraded. and after upgrade. it works well. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |