Reproduced in a 4.2 ipi-on-aws cluster, the interesting things are Kibana 4.2.36 works well on 4.3,4.5 and 4.6 and a cluster which is upgrading from 4.2 to 4.3. It seems somethings wrong between OAuth and Kibana.
[2020-07-07T10:04:57,571][ERROR][i.f.e.p.OpenshiftAPIService] Error retrieving username from token java.net.SocketException: Broken pipe (Write failed) .............. at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.13.Final.jar:4.1.13.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252] [2020-07-07T10:04:57,584][ERROR][i.f.e.p.OpenshiftRequestContextFactory] Error trying to fetch user's context from the cache com.google.common.util.concurrent.UncheckedExecutionException: ElasticsearchException[java.net.SocketException: Broken pipe (Write failed)]; nested: SocketException[Broken pipe (Write failed)]; at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2050) ~[guava-25.1-jre.jar:?] at com.google.common.cache.LocalCache.get(LocalCache.java:3951) ~[guava-25.1-jre.jar:?] at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3973) ~[guava-25.1-jre.jar:?]
AFAIK in the logs provided by the logging-dump, the connection to the api server seems to be broken because of the same reason as in [1]. The root cause is in openjdk that got bumped to 1.8.0_252. I can't see any temporary workaround here, except of expediting a backport to 4.3. com.google.common.util.concurrent.UncheckedExecutionException: ElasticsearchException[okhttp3.internal.http2.StreamResetException: stream was reset: PROTOCOL_ERROR]; nested: StreamResetException[stream was reset: PROTOCOL_ERROR]; podman run -i -t registry.redhat.io/openshift4/ose-logging-elasticsearch5@sha256:f59dc9bf080e5dec74ab4ea2a9cdea601b6f64acff4dc955f0d4c21b03fd7cb1 java -version openjdk version "1.8.0_252" OpenJDK Runtime Environment (build 1.8.0_252-b09) OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode) @jeff I am ssigning this BZ to you as you take case of [1]. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1835396
The backport issue for 4.3.z is awaiting cherry-pick [1]. This seems to be the same issue that results from automatic updates of the JDK of our images by ART I believe. Maybe a rollback to previous 4.2.z version with older images may help. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1854997
After some investigation, the last possible version to switch back is 4.2.29: ❯ podman run -i -t registry.redhat.io/openshift4/ose-logging-elasticsearch5:v4.2.29-202004140532 java -version openjdk version "1.8.0_242" OpenJDK Runtime Environment (build 1.8.0_242-b08) OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode) How to rollback, is something I need to figure out, because you need to rollback the elasticsearch-operator to 4.2.29
@Periklis worked up the following instructions to rollback the cluster image. Disclaimer: It requires manual intervention and puts the cluster-logging stack to *Unmanaged*. 1. Switch the cluster-logging instance to unmanaged oc -n openshift-logging edit clusterlogging instance Change field spec.managementState to: managementState: Unmanaged 2. Perform on Elasticsearch a shard synced flush to ensure there are no pending operations waiting to be written to disk prior to shutting down: oc exec -c elasticsearch $pod -- es_util --query=_flush/synced 3. Prevent shard balancing when purposely bringing down nodes using the OpenShift Container Platform es_util tool: oc exec -c elasticsearch $pod -- es_util --query=_cluster/settings -XPUT -d '{ "transient": { "cluster.routing.allocation.enable" : "none" } }' 4. Edit the elasticsearch custom resource instance to change the image: oc -n openshift-logging edit elasticsearch elasticsearch Change image under spec.nodeSpec to: image: registry.redhat.io/openshift4/ose-logging-elasticsearch5:v4.2.29-202004140532 5. Wait until all pods are restarted. oc -n openshift-logging get pod -l component=elasticsearch -w example output after restart: NAME READY STATUS RESTARTS AGE elasticsearch-cdm-h5bfms9n-1-5bc945c588-8xqwv 2/2 Running 0 9m59s elasticsearch-cdm-h5bfms9n-2-564f756d49-4dsgb 2/2 Running 0 15m elasticsearch-cdm-h5bfms9n-3-6b5bbd8c75-55s4r 2/2 Running 0 15m 6. Check that the Elasticsearch cluster is in green state (Make sure status field is green in response): oc exec <any_es_pod_in_the_cluster> -c elasticsearch -- health 7. Once all the deployments for the cluster have been rolled out, re-enable shard balancing: oc exec -c elasticsearch $pod -- es_util --query=_cluster/settings -XPUT -d '{ "transient": { "cluster.routing.allocation.enable" : "all" } }' **NOTE:** * I deployed a 4.2 logging cluster from OLM and was successfully able to follow these procedures to switch the image * I am unable to confirm this will resolve the issue since this problem does not present itself in the Openshift clusters available to me.
The comment 16 workaround works. I'd like to make tiny changes ( unmanaged/Managed the elastic search CRD at Step 2 and step 5.). 1. Switch the cluster-logging instance to unmanaged 2. Switch the elasticsearch custom resource to unmanaged. 3. Perform on Elasticsearch a shard synced flush to ensure there are no pending operations waiting to be written to disk prior to shutting down: 4. Prevent shard balancing when purposely bringing down nodes using the OpenShift Container Platform es_util tool: 5. Edit the elasticsearch custom resource instance to ose-logging-elasticsearch5:v4.2.29-202004140532. and switch the elasticsearch custom resource back to Managed. 6 Wait until all pods are restarted. 7. Check that the Elasticsearch cluster is in the green state (Make sure status field is green in response) 8. Once all the deployments for the cluster have been rolled out, re-enable shard balancing:
@Catherine_H Let me know any news from the customer side. (put on needsinfo)