Bug 1885674

Summary: Init container for fluentd failing due to certs
Product: OpenShift Container Platform Reporter: Candace Sheremeta <cshereme>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED ERRATA QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: high    
Version: 4.5CC: achakrat, anisal, anli, aos-bugs, cruhm, ewolinet, jcantril, jeder, kiyyappa, mrobson, periklis, puraut, rsandu
Target Milestone: ---Keywords: ServiceDeliveryImpact
Target Release: 4.7.0Flags: achakrat: needinfo?
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: osd-45-logging, logging-core
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 11:21:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Candace Sheremeta 2020-10-06 16:54:09 UTC
Service Delivery is seeing fluentd pods unable to start due to failures with the fluentd-init container. At first we saw this error in just one fluentd pod on the cluster (the other fluentd pods were Running):

$ oc logs fluentd-g4rgk -c fluentd-init 
/opt/rh/rh-ruby25/root/usr/share/ruby/net/protocol.rb:44:in `connect_nonblock': SSL_connect returned=1 errno=0 state=error: certificate verify failed (unable to get local issuer certificate) (OpenSSL::SSL::SSLError)
	from /opt/rh/rh-ruby25/root/usr/share/ruby/net/protocol.rb:44:in `ssl_socket_connect'
	from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:985:in `connect'
	from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:920:in `do_start'
	from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:909:in `start'
	from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:609:in `start'
	from wait_for_es_version.rb:26:in `<main>'

There have been a number of theories presented about the cause or resolution of this issue, including:

- the cluster logging instance being created in a namespace besides the `openshift-logging` namespace causing this issue
- bouncing the operator pod resolving this issue
- deleting the `master-certs` secret resolving this issue

However, none of those seem to apply here, and when we attempted to delete the `master-certs` secret and bounce the fluentd pods, we're currently seeing the following behavior in ALL fluentd pods on the cluster:

- fluentd pods go from Init -> InitError -> CrashLoopBackOff and then back to Init
- all fluentd pods show these errors:

$ oc logs -n openshift-logging fluentd-pdm47 -c fluentd-init
/opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:937:in `initialize': execution expired (Net::OpenTimeout)
        from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:937:in `open'
        from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:937:in `block in connect'
        from /opt/rh/rh-ruby25/root/usr/share/ruby/timeout.rb:103:in `timeout'
        from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:935:in `connect'
        from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:920:in `do_start'
        from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:909:in `start'
        from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:609:in `start'
        from wait_for_es_version.rb:26:in `<main>'

Comment 2 Jeff Cantrill 2020-10-08 21:06:52 UTC
Please point us to a cluster to which we may investigate

Comment 3 Jeff Cantrill 2020-10-20 15:55:11 UTC
Please provide a must-gather should you see this issue again https://github.com/openshift/cluster-logging-operator/tree/master/must-gather

Comment 4 Jeff Cantrill 2020-10-23 15:19:51 UTC
Setting UpcomingSprint as unable to resolve before EOD

Comment 8 Jeff Cantrill 2020-11-05 20:40:23 UTC
We have some general issues with certificate (re)generation that should be resolved by [1].  It has been observed CLO sometimes regenerates certs prematurely 

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1888958

Comment 9 Matthew Robson 2020-11-06 14:16:41 UTC
Looks like ES is running with older certs and fluentd, when it restarts during the daemonset rollout and kibana, when the CR is recreated when the EO is upgraded, are using newer ones.

To solve this:

1) 
Make sure ES is actually green: oc exec $ES_POD -c elasticsearch -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/health?v

Edit EO and set it to unmanaged
oc edit Elasticsearch
Set managementState: Managed to managementState: Unmanaged

2) 
Scale down ES to 0

3) 
Delete the elasticsearch secret in openshift-logging and wait for regenerate


4)
Scale ES back up

5)
Watch for it to go green: oc exec $ES_POD -c elasticsearch -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/health?v

6)
Delete the stuck fluentd pod which will allow the rest of the DaemonSet to continue the rollout correctly. 

7)
Delete the Kibana pod to allow it to reconnect

8)
If you're doing an upgrade, wait for the upgrade to finish:
Ex:
    nodes:
    - deploymentName: elasticsearch-cdm-9aeuzkeh-1
      upgradeStatus:
        upgradePhase: controllerUpdated
    - deploymentName: elasticsearch-cdm-9aeuzkeh-2
      upgradeStatus:
        upgradePhase: controllerUpdated
    - deploymentName: elasticsearch-cdm-9aeuzkeh-3
      upgradeStatus:
        upgradePhase: controllerUpdated

9)
Set the EO back to Managed
oc edit Elasticsearch
Set managementState: Unmanaged to managementState: Managed

This will cause ES to re-rollout one final time.

10)
Validate logs are being ingested. Depending on how long this issue has lasted, there may be a lot of logs being proceed so ES will have a lot of load until things catch up.

Comment 10 ewolinet 2020-11-09 17:29:42 UTC
(In reply to Matthew Robson from comment #9)
> Looks like ES is running with older certs and fluentd, when it restarts
> during the daemonset rollout and kibana, when the CR is recreated when the
> EO is upgraded, are using newer ones.
> 
> To solve this:
> 
> 1) 
> Make sure ES is actually green: oc exec $ES_POD -c elasticsearch -- curl -s
> -k --cert /etc/elasticsearch/secret/admin-cert --key
> /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/health?v
> 
> Edit EO and set it to unmanaged
> oc edit Elasticsearch
> Set managementState: Managed to managementState: Unmanaged
> 
> 2) 
> Scale down ES to 0
> 
> 3) 
> Delete the elasticsearch secret in openshift-logging and wait for regenerate
> 
> 
> 4)
> Scale ES back up
> 
> 5)
> Watch for it to go green: oc exec $ES_POD -c elasticsearch -- curl -s -k
> --cert /etc/elasticsearch/secret/admin-cert --key
> /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/health?v
> 
> 6)
> Delete the stuck fluentd pod which will allow the rest of the DaemonSet to
> continue the rollout correctly. 
> 
> 7)
> Delete the Kibana pod to allow it to reconnect
> 
> 8)
> If you're doing an upgrade, wait for the upgrade to finish:
> Ex:
>     nodes:
>     - deploymentName: elasticsearch-cdm-9aeuzkeh-1
>       upgradeStatus:
>         upgradePhase: controllerUpdated
>     - deploymentName: elasticsearch-cdm-9aeuzkeh-2
>       upgradeStatus:
>         upgradePhase: controllerUpdated
>     - deploymentName: elasticsearch-cdm-9aeuzkeh-3
>       upgradeStatus:
>         upgradePhase: controllerUpdated
> 
> 9)
> Set the EO back to Managed
> oc edit Elasticsearch
> Set managementState: Unmanaged to managementState: Managed
> 
> This will cause ES to re-rollout one final time.
> 
> 10)
> Validate logs are being ingested. Depending on how long this issue has
> lasted, there may be a lot of logs being proceed so ES will have a lot of
> load until things catch up.

The EO should be determining that the ES secret is different than what the current ES nodes are using and it should trigger a restart of the cluster.
It will schedule a full cluster restart in this case and then iterate through it. The elasticsearch CR status should indicate that the nodes are scheduled for a restart... was this not observed?

Comment 11 Jeff Cantrill 2020-11-19 19:47:58 UTC
Moving this to ON_QA to verify resolved by https://bugzilla.redhat.com/show_bug.cgi?id=1888958

Comment 12 Anping Li 2020-12-03 12:21:49 UTC
Couldn't reproduce. verified on clusterlogging.4.7.0-202012021027.p0

Comment 18 errata-xmlrpc 2021-02-24 11:21:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652