Description of problem: it could be seen in a customer deployment this situation: sh-4.2$ ls -l /tmp/ocp-clo/ total 76 -rw-r--r--. 1 1000600000 root 1850 Oct 16 15:00 ca.crt -rw-r--r--. 1 1000600000 root 0 Oct 15 08:39 ca.db -rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 ca.key -rw-r--r--. 1 1000600000 root 3 Oct 15 08:39 ca.serial.txt -rw-r--r--. 1 1000600000 root 2411 Oct 16 15:00 elasticsearch.crt -rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 elasticsearch.key -rw-r--r--. 1 1000600000 root 1956 Oct 16 15:00 kibana-internal.crt -rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 kibana-internal.key -rw-r--r--. 1 1000600000 root 32 Oct 16 15:00 kibana-session-secret -rw-r--r--. 1 1000600000 root 2171 Oct 16 15:00 logging-es.crt -rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 logging-es.key -rw-r--r--. 1 1000600000 root 4256 Oct 16 15:00 signing.conf -rw-r--r--. 1 1000600000 root 1923 Oct 16 15:00 system.admin.crt -rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 system.admin.key -rw-r--r--. 1 1000600000 root 1935 Oct 16 15:00 system.logging.curator.crt -rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 system.logging.curator.key -rw-r--r--. 1 1000600000 root 1935 Oct 16 15:00 system.logging.fluentd.crt -rw-r--r--. 1 1000600000 root 0 Oct 16 15:00 system.logging.fluentd.key -rw-r--r--. 1 1000600000 root 1935 Oct 16 15:00 system.logging.kibana.crt -rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 system.logging.kibana.key notice that system.logging.fluentd.key is empty. as the script https://github.com/openshift/cluster-logging-operator/blob/master/scripts/cert_generation.sh only regenerates when the .crt is not existent, this case is never contemplated and the customer can see the key missing in fluentd secret and this error: oc logs -f fluentd-xxxxx -c fluentd-init -n openshift-logging wait_for_es_version.rb:23:in `initialize': Neither PUB key nor PRIV key: header too long (OpenSSL::PKey::RSAError) from wait_for_es_version.rb:23:in `new' from wait_for_es_version.rb:23:in `<main>' Version-Release number of selected component (if applicable): 4.5 The script checks today: if [ $REGENERATE_NEEDED = 1 ] || [ ! -f ${WORKING_DIR}/${component}.crt ] || ! openssl x509 -checkend 0 -noout -in ${WORKING_DIR}/${component}.crt; then generate_cert_config $component $extensions generate_request $component sign_cert $component fi } so, only if the crt is missing. I agree this case of an empty key is extremely rare but perhaps we should contemplate this and see that if a file is corrupted (not including a valid cert for .crt or a valid key for .key files) we could launch a regeneration. How reproducible: very rarely Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This may be resolved in 4.7 with https://issues.redhat.com/browse/LOG-422
Hello, I was commenting before with @German that I was able to reproduce the issue. The files in the openshift logging operator were like this: ~~~ sh-4.2$ ls -lrt total 80 -rw-r--r--. 1 1000600000 root 3 Oct 15 08:39 ca.serial.txt -rw-r--r--. 1 1000600000 root 0 Oct 16 15:14 ca.db -rw-r--r--. 1 1000600000 root 1935 Oct 16 15:48 system.logging.fluentd.crt_backup -rw-r--r--. 1 1000600000 root 1935 Oct 16 15:48 system.logging.curator.crt -rw-r--r--. 1 1000600000 root 0 Oct 16 15:48 system.logging.fluentd.key -rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 system.logging.curator.key -rw-r--r--. 1 1000600000 root 1935 Oct 16 15:48 system.logging.fluentd.crt -rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 ca.key -rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 elasticsearch.key -rw-r--r--. 1 1000600000 root 2411 Oct 16 15:48 elasticsearch.crt -rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 logging-es.key -rw-r--r--. 1 1000600000 root 2171 Oct 16 15:48 logging-es.crt -rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 system.admin.key -rw-r--r--. 1 1000600000 root 1923 Oct 16 15:48 system.admin.crt -rw-r--r--. 1 1000600000 root 1850 Oct 16 15:48 ca.crt -rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 system.logging.kibana.key -rw-r--r--. 1 1000600000 root 1935 Oct 16 15:48 system.logging.kibana.crt -rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 kibana-internal.key -rw-r--r--. 1 1000600000 root 32 Oct 16 15:48 kibana-session-secret -rw-r--r--. 1 1000600000 root 1956 Oct 16 15:48 kibana-internal.crt -rw-r--r--. 1 1000600000 root 4256 Oct 16 15:48 signing.conf ~~~ Then, the ca.db and system.logging.fluentd.key were empty. Like the ca.db was empty, the system.logging.fluentd.key was not regenerated. I have tried the next on my own lab: ~~~ $ oc -n openshift-logging rsh <operator pod> $ cd /tmp/ocp-clo/ $ echo "" > system.logging.fluentd.key ~~~ Doing this, always the system.logging.fluentd.key is generated again with the secret and everything comes back to work. The next step was to try to have a similar configuration to the described where the issue happened getting a ca.db empty: ~~~ $ oc -n openshift-logging rsh <operator pod> $ cd /tmp/ocp-clo/ ### Delete content from ca.db $ echo "" > ca.db ### Delete content from system.logging.fluentd.key $ echo "" > system.logging.fluentd.key ~~~ In this case, as it's expected, the system.logging.fluentd.key is never regenerated. The next step, was to fill again the ca.db with the content (I knew it, but usually, you are not going to know it), and the system.logging.fluentd.key was generated again. Then, reviewing the code a possible workaround is to force the regeneration of the key. ~~~ if [ $REGENERATE_NEEDED = 1 ] || [ ! -f ${WORKING_DIR}/${component}.crt ] || ! openssl x509 -checkend 0 -noout -in ${WORKING_DIR}/${component}.crt; ~~~ In this case, as it is checking if exists "${WORKING_DIR}/${component}.crt", we opted to delete the system.logging.fluentd.crt and it regenerated the system.logging.fluentd.crt. Then, I don't know how it was possible to get the ca.db and the system.logging.fluentd.key files empty: - Do you have any idea @Jeff about how this situation can be possible? Thanks in advance, Oscar [1] https://github.com/openshift/cluster-logging-operator/blob/6f5ed87809a34a678637ee219e7b4d91cfb979a4/scripts/cert_generation.sh#L210
Hello, I was not able to see a Bug created for being backported this to OCP 4.5. Do we have one? Should I create a new one for it? Regards, Oscar
(In reply to Oscar Casal Sanchez from comment #11) > Hello, > > I was not able to see a Bug created for being backported this to OCP 4.5. Do > we have one? Should I create a new one for it? > > Regards, > Oscar @Oscar Casal Sanchez I suggest some patience here. The Backport-BZ are created when they parent BZ is VERIFIED. The openshift bots will create them automatically if, e.g. the 4.6 PR is marked with "/cherry-pick release-4.5".
Verified on clusterlogging.4.7.0-202011071430.p0. 1. Set clusterlogging/instance to Unmanaged. 2. Set fluentd.key to blank oc edit secret fluentd -n logging ->data.tls.keys: "" 3. Delete one fluentd pod. The new pod will be in error or Init:CrashLoopBackOff. The /etc/fluent/keys/tls.key in fluentd pods is blank. Note: The other fluentd pods are in Running status, although the /etc/fluent/keys/tls.key is blank. The reason is the fluentds are using old secrets until the process was restarted to load the new one. 4. Set clusterlogging/instance back to Managed. and wait for a while 5. Check the fluentd pods status. All fluentd pods are running.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0652