Bug 1730087

Summary: Regenerating logging certificates can fail with SEC_ERROR_REUSED_ISSUER_AND_SERIAL
Product: OpenShift Container Platform Reporter: Matthew Robson <mrobson>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED CURRENTRELEASE QA Contact: Anping Li <anli>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.9.0CC: aos-bugs, dporter, ewolinet, jcantril, rmeggins
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-07-18 05:17:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matthew Robson 2019-07-15 19:28:55 UTC
Description of problem:

Following the steps to regenerate new logging certs in 3.9: https://docs.openshift.com/container-platform/3.9/install_config/aggregate_logging.html#fluentd-redeploy-certs

The ansible playbook failed out.

TASK [openshift_logging_elasticsearch : Getting ES version for logging-es cluster] ***
fatal: [myDRserver2001.myDomain.tld]: FAILED! => {"changed": true, "cmd": ["oc", "--config=/etc/origin/master/admin.kubeconfig", "exec", "logging-es-nh7hh97z-12-hpdlk", "-c", "elasticsearch", "-n", "logging", "--", "curl", "-s", "--cacert", "/etc/elasticsearch/secret/admin-ca", "--cert", "/etc/elasticsearch/secret/admin-cert", "--key", "/etc/elasticsearch/secret/admin-key", "-XGET", "https://localhost:9200/"], "delta": "0:00:00.646870", "end": "2019-07-15 08:47:36.275881", "failed": true, "msg": "non-zero return code", "rc": 35, "start": "2019-07-15 08:47:35.629011", "stderr": "command terminated with exit code 35", "stderr_lines": ["command terminated with exit code 35"], "stdout": "", "stdout_lines": []}

The actual issue is:

[ansible@myDRserver2001 ~]$ oc exec logging-es-nh7hh97z-12-hpdlk -c elasticsearch -n logging -- curl -vs --cacert /etc/elasticsearch/secret/admin-ca --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key -XGET https://localhost:9200/
* About to connect() to localhost port 9200 (#0)
*   Trying ::1...
* Connected to localhost (::1) port 9200 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/elasticsearch/secret/admin-ca
  CApath: none
* NSS error -8054 (SEC_ERROR_REUSED_ISSUER_AND_SERIAL)
* You are attempting to import a cert with the same issuer/serial as an existing cert, but that is not the same cert.
* Closing connection 0
command terminated with exit code 35

Looking at the CAs and CRTs, the Serials are all ok (not matching) - this isn't related to https://github.com/openshift/openshift-ansible/issues/7252

The issue seems to be that the old crt and the new crt have the same serial and the old crt is cached somewhere with nss causing curl to fail.

Can not find where the old crt is, certutil shows an empty DB.

sh-4.2$ certutil -L -d sql:/etc/pki/nssdb

Certificate Nickname                                         Trust Attributes
                                                             SSL,S/MIME,JAR/XPI

Deleting the ES pod and allowing it to create resolves the issue as the old crt seems to be purged, but because the playbook failed, everything else in the logging stack is out of sync.

Version-Release number of selected component (if applicable):

v3.9.51


How reproducible:

Random


Steps to Reproduce:

1) Deleted the certs from the masters

2) Reran logging playbook

3) CA and Certs were recreated - the CA looks fine - serial 1

4) The new cert for ES has the same issuer CN=logging-signer-test - this is expected

5) The new cert for ES has a Serial 5 which we assume is the same Serial as the old certificate and thus fails

Actual results:


Expected results:
Playbook fails and looking stack is broken

Additional info:

Comment 11 Dirk Porter 2019-11-27 15:32:00 UTC
Hello, 

I have a customer with this issue in 3.9. Is there a workaround that was discovered for this? 

Regards, 

Dirk Porter