Bug 1888958

Summary:	certificates are regenerated only when crt is missing.
Product:	OpenShift Container Platform	Reporter:	German Parente <gparente>
Component:	Logging	Assignee:	Jeff Cantrill <jcantril>
Status:	CLOSED ERRATA	QA Contact:	Anping Li <anli>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.5	CC:	aos-bugs, dahernan, jcantril, jeder, ocasalsa, periklis, qitang, rrackow, rsandu, syedriko
Target Milestone:	---	Keywords:	ServiceDeliveryImpact
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	logging-core
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1895607 (view as bug list)		Environment:
Last Closed:	2021-02-24 11:21:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1895607

Description German Parente 2020-10-16 15:12:39 UTC

Description of problem:

it could be seen in a customer deployment this situation:


sh-4.2$ ls -l /tmp/ocp-clo/
total 76
-rw-r--r--. 1 1000600000 root 1850 Oct 16 15:00 ca.crt
-rw-r--r--. 1 1000600000 root    0 Oct 15 08:39 ca.db
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 ca.key
-rw-r--r--. 1 1000600000 root    3 Oct 15 08:39 ca.serial.txt
-rw-r--r--. 1 1000600000 root 2411 Oct 16 15:00 elasticsearch.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 elasticsearch.key
-rw-r--r--. 1 1000600000 root 1956 Oct 16 15:00 kibana-internal.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 kibana-internal.key
-rw-r--r--. 1 1000600000 root   32 Oct 16 15:00 kibana-session-secret
-rw-r--r--. 1 1000600000 root 2171 Oct 16 15:00 logging-es.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 logging-es.key
-rw-r--r--. 1 1000600000 root 4256 Oct 16 15:00 signing.conf
-rw-r--r--. 1 1000600000 root 1923 Oct 16 15:00 system.admin.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 system.admin.key
-rw-r--r--. 1 1000600000 root 1935 Oct 16 15:00 system.logging.curator.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 system.logging.curator.key
-rw-r--r--. 1 1000600000 root 1935 Oct 16 15:00 system.logging.fluentd.crt
-rw-r--r--. 1 1000600000 root    0 Oct 16 15:00 system.logging.fluentd.key
-rw-r--r--. 1 1000600000 root 1935 Oct 16 15:00 system.logging.kibana.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:00 system.logging.kibana.key

notice that system.logging.fluentd.key is empty.

as the script 

https://github.com/openshift/cluster-logging-operator/blob/master/scripts/cert_generation.sh

only regenerates when the .crt is not existent, this case is never contemplated and the customer can see the key missing in fluentd secret and this error:

oc logs -f fluentd-xxxxx -c fluentd-init -n openshift-logging
wait_for_es_version.rb:23:in `initialize': Neither PUB key nor PRIV key: header too long (OpenSSL::PKey::RSAError)
        from wait_for_es_version.rb:23:in `new'
        from wait_for_es_version.rb:23:in `<main>'

Version-Release number of selected component (if applicable): 4.5

The script checks today:


  if [ $REGENERATE_NEEDED = 1 ] || [ ! -f ${WORKING_DIR}/${component}.crt ] || ! openssl x509 -checkend 0 -noout -in ${WORKING_DIR}/${component}.crt; then
    generate_cert_config $component $extensions
    generate_request $component
    sign_cert $component
  fi
}

so, only if the crt is missing. 

I agree this case of an empty key is extremely rare but perhaps we should contemplate this and see that if a file is corrupted (not including a valid cert for .crt or a valid key for .key files) we could launch a regeneration.



How reproducible: very rarely


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Jeff Cantrill 2020-10-19 16:05:34 UTC

This may be resolved in 4.7 with https://issues.redhat.com/browse/LOG-422

Comment 3 Oscar Casal Sanchez 2020-10-19 16:33:43 UTC

Hello,

I was commenting before with @German that I was able to reproduce the issue. The files in the openshift logging operator were like this:

~~~
sh-4.2$ ls -lrt
total 80
-rw-r--r--. 1 1000600000 root    3 Oct 15 08:39 ca.serial.txt
-rw-r--r--. 1 1000600000 root    0 Oct 16 15:14 ca.db
-rw-r--r--. 1 1000600000 root 1935 Oct 16 15:48 system.logging.fluentd.crt_backup
-rw-r--r--. 1 1000600000 root 1935 Oct 16 15:48 system.logging.curator.crt
-rw-r--r--. 1 1000600000 root    0 Oct 16 15:48 system.logging.fluentd.key
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 system.logging.curator.key
-rw-r--r--. 1 1000600000 root 1935 Oct 16 15:48 system.logging.fluentd.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 ca.key
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 elasticsearch.key
-rw-r--r--. 1 1000600000 root 2411 Oct 16 15:48 elasticsearch.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 logging-es.key
-rw-r--r--. 1 1000600000 root 2171 Oct 16 15:48 logging-es.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 system.admin.key
-rw-r--r--. 1 1000600000 root 1923 Oct 16 15:48 system.admin.crt
-rw-r--r--. 1 1000600000 root 1850 Oct 16 15:48 ca.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 system.logging.kibana.key
-rw-r--r--. 1 1000600000 root 1935 Oct 16 15:48 system.logging.kibana.crt
-rw-r--r--. 1 1000600000 root 3272 Oct 16 15:48 kibana-internal.key
-rw-r--r--. 1 1000600000 root   32 Oct 16 15:48 kibana-session-secret
-rw-r--r--. 1 1000600000 root 1956 Oct 16 15:48 kibana-internal.crt
-rw-r--r--. 1 1000600000 root 4256 Oct 16 15:48 signing.conf
~~~

Then, the ca.db and system.logging.fluentd.key were empty. Like the ca.db was empty, the system.logging.fluentd.key was not regenerated.


I have tried the next on my own lab:
~~~
$ oc -n openshift-logging rsh <operator pod> 
$ cd /tmp/ocp-clo/
$ echo "" > system.logging.fluentd.key
~~~

Doing this, always the system.logging.fluentd.key is generated again with the secret and everything comes back to work.

The next step was to try to have a similar configuration to the described where the issue happened getting a ca.db empty:
~~~
$ oc -n openshift-logging rsh <operator pod> 
$ cd /tmp/ocp-clo/
### Delete content from ca.db
$ echo "" > ca.db
### Delete content from system.logging.fluentd.key
$ echo "" > system.logging.fluentd.key
~~~

In this case, as it's expected, the system.logging.fluentd.key is never regenerated. The next step, was to fill again the ca.db with the content (I knew it, but usually, you are not going to know it), and the system.logging.fluentd.key was generated again. 

Then, reviewing the code a possible workaround is to force the regeneration of the key. 

~~~
  if [ $REGENERATE_NEEDED = 1 ] || [ ! -f ${WORKING_DIR}/${component}.crt ] || ! openssl x509 -checkend 0 -noout -in ${WORKING_DIR}/${component}.crt;
~~~

In this case, as it is checking if exists "${WORKING_DIR}/${component}.crt", we opted to delete the  system.logging.fluentd.crt and it regenerated the system.logging.fluentd.crt.


Then, I don't know how it was possible to get the ca.db and the system.logging.fluentd.key files empty:

- Do you have any idea @Jeff about how this situation can be possible?


Thanks in advance,
Oscar

[1] https://github.com/openshift/cluster-logging-operator/blob/6f5ed87809a34a678637ee219e7b4d91cfb979a4/scripts/cert_generation.sh#L210

Comment 11 Oscar Casal Sanchez 2020-11-09 11:04:53 UTC

Hello,

I was not able to see a Bug created for being backported this to OCP 4.5. Do we have one? Should I create a new one for it?

Regards,
Oscar

Comment 12 Periklis Tsirakidis 2020-11-09 12:29:08 UTC

(In reply to Oscar Casal Sanchez from comment #11)
> Hello,
> 
> I was not able to see a Bug created for being backported this to OCP 4.5. Do
> we have one? Should I create a new one for it?
> 
> Regards,
> Oscar

@Oscar Casal Sanchez

I suggest some patience here. The Backport-BZ are created when they parent BZ is VERIFIED. The openshift bots will create them automatically if, e.g. the 4.6 PR is marked with "/cherry-pick release-4.5".

Comment 13 Anping Li 2020-11-10 06:31:56 UTC

Verified on clusterlogging.4.7.0-202011071430.p0.
1. Set clusterlogging/instance to Unmanaged.
2. Set fluentd.key to blank
   oc edit secret fluentd -n logging ->data.tls.keys: ""
3. Delete one fluentd pod.  The new pod will be in error or Init:CrashLoopBackOff. The /etc/fluent/keys/tls.key in fluentd pods is blank.
   Note: The other fluentd pods are in Running status, although the /etc/fluent/keys/tls.key is blank. The reason is the fluentds are using old secrets until the process was restarted to load the new one.

4. Set clusterlogging/instance back to Managed. and wait for a while
5. Check the fluentd pods status.
   All fluentd pods are running.

Comment 18 errata-xmlrpc 2021-02-24 11:21:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652