Bug 1888958 - certificates are regenerated only when crt is missing.
Summary: certificates are regenerated only when crt is missing.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Jeff Cantrill
QA Contact: Anping Li
URL:
Whiteboard: logging-core
Depends On:
Blocks: 1895607
TreeView+ depends on / blocked
 
Reported: 2020-10-16 15:12 UTC by German Parente
Modified: 2024-03-25 16:44 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1895607 (view as bug list)
Environment:
Last Closed: 2021-02-24 11:21:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-logging-operator pull 765 0 None closed Bug 1888958: Store secrets in one place and utilize mutex 2021-02-16 15:23:16 UTC
Red Hat Knowledge Base (Solution) 5495561 0 None None None 2020-10-17 11:09:03 UTC
Red Hat Product Errata RHBA-2021:0652 0 None None None 2021-02-24 11:22:11 UTC

Description German Parente 2020-10-16 15:12:39 UTC
Description of problem:

it could be seen in a customer deployment this situation:


sh-4.2$ ls -l /tmp/ocp-clo/
total 76
-rw-r--r--. 1 1000600000 root 1850 Oct 16 15:00 ca.crt
-rw-r--r--. 1 1000600000 root    0 Oct 15 08:39 ca.db
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 ca.key
-rw-r--r--. 1 1000600000 root    3 Oct 15 08:39 ca.serial.txt
-rw-r--r--. 1 1000600000 root 2411 Oct 16 15:00 elasticsearch.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 elasticsearch.key
-rw-r--r--. 1 1000600000 root 1956 Oct 16 15:00 kibana-internal.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 kibana-internal.key
-rw-r--r--. 1 1000600000 root   32 Oct 16 15:00 kibana-session-secret
-rw-r--r--. 1 1000600000 root 2171 Oct 16 15:00 logging-es.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 logging-es.key
-rw-r--r--. 1 1000600000 root 4256 Oct 16 15:00 signing.conf
-rw-r--r--. 1 1000600000 root 1923 Oct 16 15:00 system.admin.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 system.admin.key
-rw-r--r--. 1 1000600000 root 1935 Oct 16 15:00 system.logging.curator.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 system.logging.curator.key
-rw-r--r--. 1 1000600000 root 1935 Oct 16 15:00 system.logging.fluentd.crt
-rw-r--r--. 1 1000600000 root    0 Oct 16 15:00 system.logging.fluentd.key
-rw-r--r--. 1 1000600000 root 1935 Oct 16 15:00 system.logging.kibana.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 system.logging.kibana.key

notice that system.logging.fluentd.key is empty.

as the script 

https://github.com/openshift/cluster-logging-operator/blob/master/scripts/cert_generation.sh

only regenerates when the .crt is not existent, this case is never contemplated and the customer can see the key missing in fluentd secret and this error:

oc logs -f fluentd-xxxxx -c fluentd-init -n openshift-logging
wait_for_es_version.rb:23:in `initialize': Neither PUB key nor PRIV key: header too long (OpenSSL::PKey::RSAError)
        from wait_for_es_version.rb:23:in `new'
        from wait_for_es_version.rb:23:in `<main>'

Version-Release number of selected component (if applicable): 4.5

The script checks today:


  if [ $REGENERATE_NEEDED = 1 ] || [ ! -f ${WORKING_DIR}/${component}.crt ] || ! openssl x509 -checkend 0 -noout -in ${WORKING_DIR}/${component}.crt; then
    generate_cert_config $component $extensions
    generate_request $component
    sign_cert $component
  fi
}

so, only if the crt is missing. 

I agree this case of an empty key is extremely rare but perhaps we should contemplate this and see that if a file is corrupted (not including a valid cert for .crt or a valid key for .key files) we could launch a regeneration.



How reproducible: very rarely


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Jeff Cantrill 2020-10-19 16:05:34 UTC
This may be resolved in 4.7 with https://issues.redhat.com/browse/LOG-422

Comment 3 Oscar Casal Sanchez 2020-10-19 16:33:43 UTC
Hello,

I was commenting before with @German that I was able to reproduce the issue. The files in the openshift logging operator were like this:

~~~
sh-4.2$ ls -lrt
total 80
-rw-r--r--. 1 1000600000 root    3 Oct 15 08:39 ca.serial.txt
-rw-r--r--. 1 1000600000 root    0 Oct 16 15:14 ca.db
-rw-r--r--. 1 1000600000 root 1935 Oct 16 15:48 system.logging.fluentd.crt_backup
-rw-r--r--. 1 1000600000 root 1935 Oct 16 15:48 system.logging.curator.crt
-rw-r--r--. 1 1000600000 root    0 Oct 16 15:48 system.logging.fluentd.key
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 system.logging.curator.key
-rw-r--r--. 1 1000600000 root 1935 Oct 16 15:48 system.logging.fluentd.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 ca.key
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 elasticsearch.key
-rw-r--r--. 1 1000600000 root 2411 Oct 16 15:48 elasticsearch.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 logging-es.key
-rw-r--r--. 1 1000600000 root 2171 Oct 16 15:48 logging-es.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 system.admin.key
-rw-r--r--. 1 1000600000 root 1923 Oct 16 15:48 system.admin.crt
-rw-r--r--. 1 1000600000 root 1850 Oct 16 15:48 ca.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 system.logging.kibana.key
-rw-r--r--. 1 1000600000 root 1935 Oct 16 15:48 system.logging.kibana.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 kibana-internal.key
-rw-r--r--. 1 1000600000 root   32 Oct 16 15:48 kibana-session-secret
-rw-r--r--. 1 1000600000 root 1956 Oct 16 15:48 kibana-internal.crt
-rw-r--r--. 1 1000600000 root 4256 Oct 16 15:48 signing.conf
~~~

Then, the ca.db and system.logging.fluentd.key were empty. Like the ca.db was empty, the system.logging.fluentd.key was not regenerated.


I have tried the next on my own lab:
~~~
$ oc -n openshift-logging rsh <operator pod> 
$ cd /tmp/ocp-clo/
$ echo "" > system.logging.fluentd.key
~~~

Doing this, always the system.logging.fluentd.key is generated again with the secret and everything comes back to work.

The next step was to try to have a similar configuration to the described where the issue happened getting a ca.db empty:
~~~
$ oc -n openshift-logging rsh <operator pod> 
$ cd /tmp/ocp-clo/
### Delete content from ca.db
$ echo "" > ca.db
### Delete content from system.logging.fluentd.key
$ echo "" > system.logging.fluentd.key
~~~

In this case, as it's expected, the system.logging.fluentd.key is never regenerated. The next step, was to fill again the ca.db with the content (I knew it, but usually, you are not going to know it), and the system.logging.fluentd.key was generated again. 

Then, reviewing the code a possible workaround is to force the regeneration of the key. 

~~~
  if [ $REGENERATE_NEEDED = 1 ] || [ ! -f ${WORKING_DIR}/${component}.crt ] || ! openssl x509 -checkend 0 -noout -in ${WORKING_DIR}/${component}.crt;
~~~

In this case, as it is checking if exists "${WORKING_DIR}/${component}.crt", we opted to delete the  system.logging.fluentd.crt and it regenerated the system.logging.fluentd.crt.


Then, I don't know how it was possible to get the ca.db and the system.logging.fluentd.key files empty:

- Do you have any idea @Jeff about how this situation can be possible?


Thanks in advance,
Oscar

[1] https://github.com/openshift/cluster-logging-operator/blob/6f5ed87809a34a678637ee219e7b4d91cfb979a4/scripts/cert_generation.sh#L210

Comment 11 Oscar Casal Sanchez 2020-11-09 11:04:53 UTC
Hello,

I was not able to see a Bug created for being backported this to OCP 4.5. Do we have one? Should I create a new one for it?

Regards,
Oscar

Comment 12 Periklis Tsirakidis 2020-11-09 12:29:08 UTC
(In reply to Oscar Casal Sanchez from comment #11)
> Hello,
> 
> I was not able to see a Bug created for being backported this to OCP 4.5. Do
> we have one? Should I create a new one for it?
> 
> Regards,
> Oscar

@Oscar Casal Sanchez

I suggest some patience here. The Backport-BZ are created when they parent BZ is VERIFIED. The openshift bots will create them automatically if, e.g. the 4.6 PR is marked with "/cherry-pick release-4.5".

Comment 13 Anping Li 2020-11-10 06:31:56 UTC
Verified on clusterlogging.4.7.0-202011071430.p0.
1. Set clusterlogging/instance to Unmanaged.
2. Set fluentd.key to blank
   oc edit secret fluentd -n logging ->data.tls.keys: ""
3. Delete one fluentd pod.  The new pod will be in error or Init:CrashLoopBackOff. The /etc/fluent/keys/tls.key in fluentd pods is blank.
   Note: The other fluentd pods are in Running status, although the /etc/fluent/keys/tls.key is blank. The reason is the fluentds are using old secrets until the process was restarted to load the new one.

4. Set clusterlogging/instance back to Managed. and wait for a while
5. Check the fluentd pods status.
   All fluentd pods are running.

Comment 18 errata-xmlrpc 2021-02-24 11:21:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652


Note You need to log in before you can comment on or make changes to this bug.