Service Delivery is seeing fluentd pods unable to start due to failures with the fluentd-init container. At first we saw this error in just one fluentd pod on the cluster (the other fluentd pods were Running): $ oc logs fluentd-g4rgk -c fluentd-init /opt/rh/rh-ruby25/root/usr/share/ruby/net/protocol.rb:44:in `connect_nonblock': SSL_connect returned=1 errno=0 state=error: certificate verify failed (unable to get local issuer certificate) (OpenSSL::SSL::SSLError) from /opt/rh/rh-ruby25/root/usr/share/ruby/net/protocol.rb:44:in `ssl_socket_connect' from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:985:in `connect' from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:920:in `do_start' from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:909:in `start' from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:609:in `start' from wait_for_es_version.rb:26:in `<main>' There have been a number of theories presented about the cause or resolution of this issue, including: - the cluster logging instance being created in a namespace besides the `openshift-logging` namespace causing this issue - bouncing the operator pod resolving this issue - deleting the `master-certs` secret resolving this issue However, none of those seem to apply here, and when we attempted to delete the `master-certs` secret and bounce the fluentd pods, we're currently seeing the following behavior in ALL fluentd pods on the cluster: - fluentd pods go from Init -> InitError -> CrashLoopBackOff and then back to Init - all fluentd pods show these errors: $ oc logs -n openshift-logging fluentd-pdm47 -c fluentd-init /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:937:in `initialize': execution expired (Net::OpenTimeout) from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:937:in `open' from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:937:in `block in connect' from /opt/rh/rh-ruby25/root/usr/share/ruby/timeout.rb:103:in `timeout' from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:935:in `connect' from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:920:in `do_start' from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:909:in `start' from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:609:in `start' from wait_for_es_version.rb:26:in `<main>'
Please point us to a cluster to which we may investigate
Please provide a must-gather should you see this issue again https://github.com/openshift/cluster-logging-operator/tree/master/must-gather
Setting UpcomingSprint as unable to resolve before EOD
We have some general issues with certificate (re)generation that should be resolved by [1]. It has been observed CLO sometimes regenerates certs prematurely [1] https://bugzilla.redhat.com/show_bug.cgi?id=1888958
Looks like ES is running with older certs and fluentd, when it restarts during the daemonset rollout and kibana, when the CR is recreated when the EO is upgraded, are using newer ones. To solve this: 1) Make sure ES is actually green: oc exec $ES_POD -c elasticsearch -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/health?v Edit EO and set it to unmanaged oc edit Elasticsearch Set managementState: Managed to managementState: Unmanaged 2) Scale down ES to 0 3) Delete the elasticsearch secret in openshift-logging and wait for regenerate 4) Scale ES back up 5) Watch for it to go green: oc exec $ES_POD -c elasticsearch -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/health?v 6) Delete the stuck fluentd pod which will allow the rest of the DaemonSet to continue the rollout correctly. 7) Delete the Kibana pod to allow it to reconnect 8) If you're doing an upgrade, wait for the upgrade to finish: Ex: nodes: - deploymentName: elasticsearch-cdm-9aeuzkeh-1 upgradeStatus: upgradePhase: controllerUpdated - deploymentName: elasticsearch-cdm-9aeuzkeh-2 upgradeStatus: upgradePhase: controllerUpdated - deploymentName: elasticsearch-cdm-9aeuzkeh-3 upgradeStatus: upgradePhase: controllerUpdated 9) Set the EO back to Managed oc edit Elasticsearch Set managementState: Unmanaged to managementState: Managed This will cause ES to re-rollout one final time. 10) Validate logs are being ingested. Depending on how long this issue has lasted, there may be a lot of logs being proceed so ES will have a lot of load until things catch up.
(In reply to Matthew Robson from comment #9) > Looks like ES is running with older certs and fluentd, when it restarts > during the daemonset rollout and kibana, when the CR is recreated when the > EO is upgraded, are using newer ones. > > To solve this: > > 1) > Make sure ES is actually green: oc exec $ES_POD -c elasticsearch -- curl -s > -k --cert /etc/elasticsearch/secret/admin-cert --key > /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/health?v > > Edit EO and set it to unmanaged > oc edit Elasticsearch > Set managementState: Managed to managementState: Unmanaged > > 2) > Scale down ES to 0 > > 3) > Delete the elasticsearch secret in openshift-logging and wait for regenerate > > > 4) > Scale ES back up > > 5) > Watch for it to go green: oc exec $ES_POD -c elasticsearch -- curl -s -k > --cert /etc/elasticsearch/secret/admin-cert --key > /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/health?v > > 6) > Delete the stuck fluentd pod which will allow the rest of the DaemonSet to > continue the rollout correctly. > > 7) > Delete the Kibana pod to allow it to reconnect > > 8) > If you're doing an upgrade, wait for the upgrade to finish: > Ex: > nodes: > - deploymentName: elasticsearch-cdm-9aeuzkeh-1 > upgradeStatus: > upgradePhase: controllerUpdated > - deploymentName: elasticsearch-cdm-9aeuzkeh-2 > upgradeStatus: > upgradePhase: controllerUpdated > - deploymentName: elasticsearch-cdm-9aeuzkeh-3 > upgradeStatus: > upgradePhase: controllerUpdated > > 9) > Set the EO back to Managed > oc edit Elasticsearch > Set managementState: Unmanaged to managementState: Managed > > This will cause ES to re-rollout one final time. > > 10) > Validate logs are being ingested. Depending on how long this issue has > lasted, there may be a lot of logs being proceed so ES will have a lot of > load until things catch up. The EO should be determining that the ES secret is different than what the current ES nodes are using and it should trigger a restart of the cluster. It will schedule a full cluster restart in this case and then iterate through it. The elasticsearch CR status should indicate that the nodes are scheduled for a restart... was this not observed?
Moving this to ON_QA to verify resolved by https://bugzilla.redhat.com/show_bug.cgi?id=1888958
Couldn't reproduce. verified on clusterlogging.4.7.0-202012021027.p0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0652