Bug 1918920

Summary: The elasticsearch pods doesn't get restarted automatically after each update
Product: OpenShift Container Platform Reporter: KOSAL RAJ I <kiyyappa>
Component: LoggingAssignee: Gerard Vanloo <gvanloo>
Status: CLOSED NEXTRELEASE QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: high    
Version: 4.5CC: afield, anli, aos-bugs, apjagtap, bleanhar, gvanloo, jkandasa, kearls, mrobson, msweiker, qitang
Target Milestone: ---   
Target Release: 4.8.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: logging-exploration
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The elasticsearch pods are not marked to be restarted after secret changes. Consequence: The elasticsearch pods doesn't get restarted automatically after each update Fix: Add a new controller to watch the secret of elasticsearch. If there is any change to the secret, controller will change the elasticsearch cluster's status to be "scheduledRedeploy" Result: Elasticsearch cluster can be restarted correctly after its secret is changed.
Story Points: ---
Clone Of:
: 1923788 1952968 (view as bug list) Environment:
Last Closed: 2021-07-23 14:50:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description KOSAL RAJ I 2021-01-21 17:20:18 UTC
Description of problem:
The elasticsearch pods doesn't get restarted automatically after each update

Version-Release number of selected component (if applicable):
OCP 4.5

How reproducible:
Enabling automatic update for logging cluster

Steps to Reproduce:
1.
2.
3.

Actual results:
ES log:
2021/01/04 08:48:25 http: TLS handshake error from 10.128.0.29:54900: tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "openshift-cluster-logging-signer")
2021/01/04 08:48:27 http: TLS handshake error from 10.128.4.21:49714: remote error: tls: unknown certificate authority
2021/01/04 08:48:41 http: TLS handshake error from 10.128.0.29:55360: tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "openshift-cluster-logging-signer")
2021/01/04 08:48:46 http: TLS handshake error from 10.128.0.29:55518: tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "openshift-cluster-logging-signer")
2021/01/04 08:48:48 http: TLS handshake error from 10.128.0.29:55600: tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "openshift-cluster-logging-signer")
time="2021-01-04T08:48:51Z" level=info msg="Handling request \"authorization\""
time="2021-01-04T08:48:52Z" level=info msg="Handling request \"authorization\""

Expected results:
ES pods should get updated.

Additional info:

Similar issue share in the KCS: https://access.redhat.com/solutions/5347071

Manual restarting of pod fixes the issue.

Comment 1 Jeff Cantrill 2021-01-25 17:12:00 UTC
Eric thoughts is this would be fixed by https://github.com/openshift/cluster-logging-operator/pull/858 to fix up the cert storage and generation?  It smells of the same issue

Comment 2 ewolinet 2021-01-25 18:09:15 UTC
It seems to be based on the same issue, yes.

We have a PR in the works for EO that should help the operator recognize when this happens and reschedule all es pods to be restarted.

Comment 13 Matthew Sweikert 2021-02-24 20:50:33 UTC
Hello,

As mentioned earlier, this ticket is currently blocking https://issues.redhat.com/browse/TRACING-1725.  I moved this issue to urgent as this will now push the customer into an unsupported state, sitting on OCP 4.4.

Comment 17 Hui Kang 2021-03-12 19:27:42 UTC
Two jira issues:

- 5.1: https://issues.redhat.com/browse/LOG-1205
- 5.0: https://issues.redhat.com/browse/LOG-1206

Comment 22 Gerard Vanloo 2021-07-23 14:50:24 UTC
Closed. This was moved to JIRA: https://issues.redhat.com/browse/LOG-1619