1854304 – All users cannot view the log through kibana

Bug 1854304 - All users cannot view the log through kibana

Summary: All users cannot view the log through kibana

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.2.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.2.z
Assignee:	Periklis Tsirakidis
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:	1854997
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-07 07:07 UTC by Catherine_H
Modified:	2023-10-06 20:59 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-10 20:02:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5697171	0	None	None	None	2021-01-12 06:28:10 UTC

Comment 4 Anping Li 2020-07-07 09:45:19 UTC

Reproduced in a 4.2 ipi-on-aws cluster, the interesting things are Kibana 4.2.36 works well on 4.3,4.5 and 4.6 and a cluster which is upgrading from 4.2 to 4.3.  It seems somethings wrong between OAuth and Kibana.

Comment 5 Anping Li 2020-07-07 10:24:16 UTC

[2020-07-07T10:04:57,571][ERROR][i.f.e.p.OpenshiftAPIService] Error retrieving username from token
java.net.SocketException: Broken pipe (Write failed)
        ..............	
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.13.Final.jar:4.1.13.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
[2020-07-07T10:04:57,584][ERROR][i.f.e.p.OpenshiftRequestContextFactory] Error trying to fetch user's context from the cache
com.google.common.util.concurrent.UncheckedExecutionException: ElasticsearchException[java.net.SocketException: Broken pipe (Write failed)]; nested: SocketException[Broken pipe (Write failed)];
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2050) ~[guava-25.1-jre.jar:?]
	at com.google.common.cache.LocalCache.get(LocalCache.java:3951) ~[guava-25.1-jre.jar:?]
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3973) ~[guava-25.1-jre.jar:?]

Comment 9 Periklis Tsirakidis 2020-07-08 14:24:18 UTC

AFAIK in the logs provided by the logging-dump, the connection to the api server seems to be broken because of the same reason as in [1]. The root cause is in openjdk that got bumped to 1.8.0_252. I can't see any temporary workaround here, except of expediting a backport to 4.3.

com.google.common.util.concurrent.UncheckedExecutionException: ElasticsearchException[okhttp3.internal.http2.StreamResetException: stream was reset: PROTOCOL_ERROR]; nested: StreamResetException[stream was reset: PROTOCOL_ERROR];

podman run -i -t registry.redhat.io/openshift4/ose-logging-elasticsearch5@sha256:f59dc9bf080e5dec74ab4ea2a9cdea601b6f64acff4dc955f0d4c21b03fd7cb1 java -version
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)


@jeff I am ssigning this BZ to you as you take case of [1].


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1835396

Comment 11 Periklis Tsirakidis 2020-07-09 07:14:25 UTC

The backport issue for 4.3.z is awaiting cherry-pick [1]. This seems to be the same issue that results from automatic updates of the JDK of our images by ART I believe. Maybe a rollback to previous 4.2.z version with older images may help.


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1854997

Comment 13 Periklis Tsirakidis 2020-07-09 10:51:12 UTC

After some investigation, the last possible version to switch back is 4.2.29:

❯ podman run -i -t registry.redhat.io/openshift4/ose-logging-elasticsearch5:v4.2.29-202004140532 java -version
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)


How to rollback, is something I need to figure out, because you need to rollback the elasticsearch-operator to 4.2.29

Comment 16 Jeff Cantrill 2020-07-09 18:52:09 UTC

@Periklis worked up the following instructions to rollback the cluster image.
Disclaimer: It requires manual intervention and puts the cluster-logging stack to *Unmanaged*.
1. Switch the cluster-logging instance to unmanaged
oc -n openshift-logging edit clusterlogging instance
Change field spec.managementState to:
managementState: Unmanaged

2. Perform on Elasticsearch a shard synced flush to ensure there are no pending operations waiting to be written to disk prior to shutting down:
oc exec -c elasticsearch $pod -- es_util --query=_flush/synced

3. Prevent shard balancing when purposely bringing down nodes using the OpenShift Container Platform es_util tool:
oc exec -c elasticsearch $pod -- es_util --query=_cluster/settings -XPUT -d '{ "transient": { "cluster.routing.allocation.enable" : "none" } }'

4. Edit the elasticsearch custom resource instance to change the image:
oc -n openshift-logging edit elasticsearch elasticsearch
Change image under spec.nodeSpec to:
image: registry.redhat.io/openshift4/ose-logging-elasticsearch5:v4.2.29-202004140532

5. Wait until all pods are restarted.
oc -n openshift-logging get pod -l component=elasticsearch -w
example output after restart:
NAME READY STATUS RESTARTS AGE
elasticsearch-cdm-h5bfms9n-1-5bc945c588-8xqwv 2/2 Running 0 9m59s
elasticsearch-cdm-h5bfms9n-2-564f756d49-4dsgb 2/2 Running 0 15m
elasticsearch-cdm-h5bfms9n-3-6b5bbd8c75-55s4r 2/2 Running 0 15m

6. Check that the Elasticsearch cluster is in green state (Make sure status field is green in response):
oc exec <any_es_pod_in_the_cluster> -c elasticsearch -- health

7. Once all the deployments for the cluster have been rolled out, re-enable shard balancing:
oc exec -c elasticsearch $pod -- es_util --query=_cluster/settings -XPUT -d '{ "transient": { "cluster.routing.allocation.enable" : "all" } }'

**NOTE:**
* I deployed a 4.2 logging cluster from OLM and was successfully able to follow these procedures to switch the image
* I am unable to confirm this will resolve the issue since this problem does not present itself in the Openshift clusters available to me.

Comment 17 Anping Li 2020-07-10 03:21:18 UTC

The comment 16 workaround works. I'd like to make tiny changes ( unmanaged/Managed the elastic search CRD at Step 2 and step 5.).
1. Switch the cluster-logging instance to unmanaged
2. Switch the elasticsearch custom resource to unmanaged.
3. Perform on Elasticsearch a shard synced flush to ensure there are no pending operations waiting to be written to disk prior to shutting down:
4. Prevent shard balancing when purposely bringing down nodes using the OpenShift Container Platform es_util tool:
5. Edit the elasticsearch custom resource instance to ose-logging-elasticsearch5:v4.2.29-202004140532. and switch the elasticsearch custom resource back to Managed.
6  Wait until all pods are restarted.
7. Check that the Elasticsearch cluster is in the green state (Make sure status field is green in response)
8. Once all the deployments for the cluster have been rolled out, re-enable shard balancing:

Comment 18 Periklis Tsirakidis 2020-07-10 06:14:22 UTC

@Catherine_H

Let me know any news from the customer side. (put on needsinfo)

Note You need to log in before you can comment on or make changes to this bug.