Bug 1854304 - All users cannot view the log through kibana
Summary: All users cannot view the log through kibana
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.2.z
Assignee: Periklis Tsirakidis
QA Contact: Anping Li
URL:
Whiteboard:
Depends On: 1854997
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-07 07:07 UTC by Catherine_H
Modified: 2023-10-06 20:59 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-10 20:02:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5697171 0 None None None 2021-01-12 06:28:10 UTC

Comment 4 Anping Li 2020-07-07 09:45:19 UTC
Reproduced in a 4.2 ipi-on-aws cluster, the interesting things are Kibana 4.2.36 works well on 4.3,4.5 and 4.6 and a cluster which is upgrading from 4.2 to 4.3.  It seems somethings wrong between OAuth and Kibana.

Comment 5 Anping Li 2020-07-07 10:24:16 UTC
[2020-07-07T10:04:57,571][ERROR][i.f.e.p.OpenshiftAPIService] Error retrieving username from token
java.net.SocketException: Broken pipe (Write failed)
        ..............	
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.13.Final.jar:4.1.13.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
[2020-07-07T10:04:57,584][ERROR][i.f.e.p.OpenshiftRequestContextFactory] Error trying to fetch user's context from the cache
com.google.common.util.concurrent.UncheckedExecutionException: ElasticsearchException[java.net.SocketException: Broken pipe (Write failed)]; nested: SocketException[Broken pipe (Write failed)];
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2050) ~[guava-25.1-jre.jar:?]
	at com.google.common.cache.LocalCache.get(LocalCache.java:3951) ~[guava-25.1-jre.jar:?]
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3973) ~[guava-25.1-jre.jar:?]

Comment 9 Periklis Tsirakidis 2020-07-08 14:24:18 UTC
AFAIK in the logs provided by the logging-dump, the connection to the api server seems to be broken because of the same reason as in [1]. The root cause is in openjdk that got bumped to 1.8.0_252. I can't see any temporary workaround here, except of expediting a backport to 4.3.

com.google.common.util.concurrent.UncheckedExecutionException: ElasticsearchException[okhttp3.internal.http2.StreamResetException: stream was reset: PROTOCOL_ERROR]; nested: StreamResetException[stream was reset: PROTOCOL_ERROR];

podman run -i -t registry.redhat.io/openshift4/ose-logging-elasticsearch5@sha256:f59dc9bf080e5dec74ab4ea2a9cdea601b6f64acff4dc955f0d4c21b03fd7cb1 java -version
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)


@jeff I am ssigning this BZ to you as you take case of [1].


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1835396

Comment 11 Periklis Tsirakidis 2020-07-09 07:14:25 UTC
The backport issue for 4.3.z is awaiting cherry-pick [1]. This seems to be the same issue that results from automatic updates of the JDK of our images by ART I believe. Maybe a rollback to previous 4.2.z version with older images may help.


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1854997

Comment 13 Periklis Tsirakidis 2020-07-09 10:51:12 UTC
After some investigation, the last possible version to switch back is 4.2.29:

❯ podman run -i -t registry.redhat.io/openshift4/ose-logging-elasticsearch5:v4.2.29-202004140532 java -version
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)


How to rollback, is something I need to figure out, because you need to rollback the elasticsearch-operator to 4.2.29

Comment 16 Jeff Cantrill 2020-07-09 18:52:09 UTC
@Periklis worked up the following instructions to rollback the cluster image.
Disclaimer: It requires manual intervention and puts the cluster-logging stack to *Unmanaged*.
1. Switch the cluster-logging instance to unmanaged
oc -n openshift-logging edit clusterlogging instance
Change field spec.managementState to:
  managementState: Unmanaged

2. Perform on Elasticsearch a shard synced flush to ensure there are no pending operations waiting to be written to disk prior to shutting down:
oc exec -c elasticsearch $pod -- es_util --query=_flush/synced

3. Prevent shard balancing when purposely bringing down nodes using the OpenShift Container Platform es_util tool:
oc exec -c elasticsearch $pod -- es_util --query=_cluster/settings -XPUT -d '{ "transient": { "cluster.routing.allocation.enable" : "none" } }'

4. Edit the elasticsearch custom resource instance to change the image:
oc -n openshift-logging edit elasticsearch elasticsearch
Change image under spec.nodeSpec to:
  image: registry.redhat.io/openshift4/ose-logging-elasticsearch5:v4.2.29-202004140532

5. Wait until all pods are restarted.
oc -n openshift-logging get pod -l component=elasticsearch -w
example output after restart:
NAME                                            READY   STATUS    RESTARTS   AGE
elasticsearch-cdm-h5bfms9n-1-5bc945c588-8xqwv   2/2     Running   0          9m59s
elasticsearch-cdm-h5bfms9n-2-564f756d49-4dsgb   2/2     Running   0          15m
elasticsearch-cdm-h5bfms9n-3-6b5bbd8c75-55s4r   2/2     Running   0          15m

6. Check that the Elasticsearch cluster is in green state (Make sure status field is green in response):
oc exec <any_es_pod_in_the_cluster> -c elasticsearch -- health

7. Once all the deployments for the cluster have been rolled out, re-enable shard balancing:
oc exec -c elasticsearch $pod -- es_util --query=_cluster/settings -XPUT -d '{ "transient": { "cluster.routing.allocation.enable" : "all" } }'


**NOTE:**
* I deployed a 4.2 logging cluster from OLM and was successfully able to follow these procedures to switch the image
* I am unable to confirm this will resolve the issue since this problem does not present itself in the Openshift clusters available to me.

Comment 17 Anping Li 2020-07-10 03:21:18 UTC
The comment 16 workaround works. I'd like to make tiny changes ( unmanaged/Managed the elastic search CRD at Step 2 and step 5.).
1. Switch the cluster-logging instance to unmanaged
2. Switch the elasticsearch custom resource to unmanaged.
3. Perform on Elasticsearch a shard synced flush to ensure there are no pending operations waiting to be written to disk prior to shutting down:
4. Prevent shard balancing when purposely bringing down nodes using the OpenShift Container Platform es_util tool:
5. Edit the elasticsearch custom resource instance to ose-logging-elasticsearch5:v4.2.29-202004140532. and switch the elasticsearch custom resource back to Managed.
6  Wait until all pods are restarted.
7. Check that the Elasticsearch cluster is in the green state (Make sure status field is green in response)
8. Once all the deployments for the cluster have been rolled out, re-enable shard balancing:

Comment 18 Periklis Tsirakidis 2020-07-10 06:14:22 UTC
@Catherine_H

Let me know any news from the customer side. (put on needsinfo)


Note You need to log in before you can comment on or make changes to this bug.