Bug 1810036 - "You are attempting to import a cert with the same issuer/serial as an existing cert, but that is not the same cert" after upgrade
Summary: "You are attempting to import a cert with the same issuer/serial as an existi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: service-ca
Version: 4.4
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.5.0
Assignee: Maru Newby
QA Contact: scheng
URL:
Whiteboard:
Depends On:
Blocks: 1810418
TreeView+ depends on / blocked
 
Reported: 2020-03-04 13:03 UTC by Junqi Zhao
Modified: 2023-09-07 22:11 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1810418 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:17:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
CA cert chain (3.92 KB, text/plain)
2020-03-04 16:31 UTC, Standa Laznicka
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift library-go pull 726 0 None closed Bug 1810036: Set a random serial number for signing certificate templates 2021-02-17 13:05:10 UTC
Github openshift service-ca-operator pull 110 0 None closed Bug 1810036: Ensure service CA certs are created with unique serial numbers 2021-02-17 13:05:10 UTC
Red Hat Knowledge Base (Solution) 4915371 0 None None None 2020-03-24 08:46:57 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:18:17 UTC

Internal Links: 1814390

Description Junqi Zhao 2020-03-04 13:03:37 UTC
Description of problem:
4.3.3 cluster enabled FIPS, after upgrade to 4.4.0-0.nightly-2020-03-04-000622, curl the https endpoint, it reports error "You are attempting to import a cert with the same issuer/serial as an existing cert, but that is not the same cert", the http endpoint does not have this issue since it does not need cert.

NOTE: It seems this issue does not affect the cluster functions
*********************************************
# oc -n openshift-apiserver get pod -o wide
NAME                       READY   STATUS    RESTARTS   AGE    IP            NODE                                       NOMINATED NODE   READINESS GATES
apiserver-88b6f947-7j4dz   1/1     Running   0          116m   10.128.0.8    qe-jia-nb8tg-m-1.c.openshift-qe.internal   <none>           <none>
apiserver-88b6f947-lp624   1/1     Running   0          116m   10.130.0.41   qe-jia-nb8tg-m-2.c.openshift-qe.internal   <none>           <none>
apiserver-88b6f947-z7wg6   1/1     Running   0          117m   10.129.0.29   qe-jia-nb8tg-m-0.c.openshift-qe.internal   <none>           <none>

# oc -n openshift-apiserver get ep
NAME   ENDPOINTS                                           AGE
api    10.128.0.8:8443,10.129.0.29:8443,10.130.0.41:8443   4h34m

# oc -n openshift-apiserver exec apiserver-88b6f947-7j4dz -- curl -k 'https://10.128.0.8:8443/metrics'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (35) You are attempting to import a cert with the same issuer/serial as an existing cert, but that is not the same cert.

FYI, the normal result should be like the followings on FIPS disabled cluster since it needs sa token
*********************************
# oc -n openshift-apiserver exec  apiserver-5cfd58bdbd-m86n6 -- curl -k 'https://10.128.0.39:8443/metics'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   239  100   239    0     0   1531      0 --:-{:-- --:--:-- --:--:--     0
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {
    
  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/metics\"",
  "reason": "Forbidden",
  "details": {
    
  },
  "code": 403
}
*********************************


the same FIPS enabled cluster, http does not have issue

# oc -n openshift-cluster-version get pod -o wide
NAME                                        READY   STATUS    RESTARTS   AGE    IP         NODE                                       NOMINATED NODE   READINESS GATES
cluster-version-operator-77fbd67d45-fzcts   1/1     Running   0          119m   10.0.0.6   qe-jia-nb8tg-m-2.c.openshift-qe.internal   <none>           <none>

# oc -n openshift-cluster-version get ep
NAME                       ENDPOINTS       AGE
cluster-version-operator   10.0.0.6:9099   4h48m

# oc -n openshift-cluster-version exec cluster-version-operator-77fbd67d45-fzcts -- curl 'http://10.0.0.6:9099/metrics'
...
# TYPE cluster_installer gauge
cluster_installer{invoker="user",type="openshift-install",version="v4.3.3"} 1
# HELP cluster_operator_condition_transitions Reports the number of times that a condition on a cluster operator changes status
# TYPE cluster_operator_condition_transitions gauge
cluster_operator_condition_transitions{condition="Available",name="console"} 1
cluster_operator_condition_transitions{condition="Available",name="ingress"} 2
*********************************************


Version-Release number of selected component (if applicable):
4.3.3 cluster enabled FIPS, and upgrade to 4.4.0-0.nightly-2020-03-04-000622

How reproducible:
always

Steps to Reproduce:
1. See the description
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Standa Laznicka 2020-03-04 16:31:49 UTC
Created attachment 1667540 [details]
CA cert chain

Comment 3 Junqi Zhao 2020-03-05 04:35:54 UTC
Confirmed, this issue has nothing to do with FIPS, it is a regression bug after upgrade

Comment 4 Maru Newby 2020-03-05 08:27:30 UTC
Bumping to urgent. This needs to be backported as far back as 4.2 ASAP to avoid potentially impacting anyone upgrading to a rotation-supporting zstream release.

On the bright side, this problem doesn't appear in golang code so it's only non-golang code that is likely to be impacted.

Comment 8 Scott Dodson 2020-03-18 15:52:03 UTC
Copying the assessment from the downstream 4.3 bug 1810420 since this is the bug that's referenced in the blocked edges.

Who is impacted?
  All customers upgrading from to 4.3.5 that run workloads which use service-ca for certs.
  All customers that install fresh 4.3.5 cluster will be affected on rotation after 13 months if they don't upgrade.

What is the impact?
  All workloads that use non-golang SSL network clients which use service-ca to communicate with platform or between each-other (eg. curl). 

How involved is remediation?
  Manual rotation of service-ca will fix the cluster for the next rotation (13 months).

Is this a regression?
  Yes, this was introduced as part of the automated service-ca rotation and released on March 10 in 4.3.5.

> Can you please add the step for manual workaround. It would be useful for CEE


https://docs.openshift.com/container-platform/4.3/authentication/certificates/service-serving-certificate.html#manually-rotate-service-ca_service-serving-certificate

Comment 9 Ryan Howe 2020-03-18 20:40:32 UTC
Is this only an issue if you have FIPs enabled 
Can FIPS only be enabled if you install with 4.3 and enable it at install time?

So if you have fips enabled the workaround is to rotate manually following these steps?:

https://docs.openshift.com/container-platform/4.3/authentication/certificates/service-serving-certificate.html#manually-rotate-service-ca_service-serving-certificate

Do we really want to point out a command that will delete every pod in the cluster...  (well try, it will fail likely causing the cluster to fall over for a time being)

Comment 10 Maru Newby 2020-03-18 21:27:50 UTC
(In reply to Ryan Howe from comment #9)
> Is this only an issue if you have FIPs enabled 
> Can FIPS only be enabled if you install with 4.3 and enable it at install
> time?

This issue is not limited to a cluster with FIPS enabled. It affects any cluster that is upgraded to a release that enables automated service CA rotation without also ensuring unique CA serial numbers.

> So if you have fips enabled the workaround is to rotate manually following
> these steps?:
> 
> https://docs.openshift.com/container-platform/4.3/authentication/
> certificates/service-serving-certificate.html#manually-rotate-service-
> ca_service-serving-certificate
> 
> Do we really want to point out a command that will delete every pod in the
> cluster...  (well try, it will fail likely causing the cluster to fall over
> for a time being)

The provided link is the documented procedure for manual CA rotation. Manual deletion of all pods will disrupt all services in the cluster - including the control plane - but the cluster will recover. It's similar to the node drain that occurs on every upgrade.

Comment 11 W. Trevor King 2020-03-19 17:03:16 UTC
Expanding on the "Who is impacted?" from comment 8, so we know which update recommendations to pull and which releases to tombstone:

The bugs were introduced by the bug 1774121 series, and fixed by the combination of this series and bug 1801573.  Quick overview:

* 4.4: both rc.0 and rc.1 affected, so updates into rc.0 and tombstone rc.1 are impacted (and running either RC for 13+ months will also hit a broken CA rotation).  Fixes have landed, so next 4.4 RC should be clean.
* 4.3: 4.3.5 introduced the breakage, so updated into 4.3.5 are impacted.  No fix yet.
* 4.2: 4.2.22 introduced the breakage, so updates into 4.2.22, 4.2.23, and 4.2.24 are impacted.  No fix yet.
* 4.1: not impacted yet.  Bug 1774157 was backporting the breaking change, and is still ASSIGNED.

Reasoning behind the overview's claims:

* 4.5: Introduced by bug 1774121 (no linked PR, so not sure exactly when it was introduced).  Fixed by bug 1810036, service-ca-operator 74b5ce2 [1], which included library-go d9c73bb [2].

  Also fixed by bug 1801573, oauth-proxy 3d0621e [3], which landed before the 4.4/4.5 split.

* 4.4: Introduced by bug 1774121 (no linked PR, so not sure exactly when it was introduced).  Fixed by bug 1810418, service-ca-operator e5a04d6 [4], which included library-go 3c25293 [5].

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.0-x86_64 | grep service-ca-operator
    service-ca-operator                            https://github.com/openshift/service-ca-operator                            094a9ad02dbe3bcb57d5fbad301cfcfcd48bd2ed
  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.1-x86_64 | grep service-ca-operator
    service-ca-operator                            https://github.com/openshift/service-ca-operator                            094a9ad02dbe3bcb57d5fbad301cfcfcd48bd2ed
  $ git --no-pager log -2 --first-parent --oneline origin/release-4.4
  e5a04d6a (origin/release-4.4) Merge pull request #111 from marun/4.4-unique-ca-serial
  094a9ad0 Merge pull request #95 from vareti/signer-ca-metrics

  So both RCs are affected.

  Also fixed by bug 1801573, oauth-proxy 3d0621e [3], which landed before the 4.4/4.5 split.

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.0-x86_64 | grep oauth-proxy
    oauth-proxy                                    https://github.com/openshift/oauth-proxy                                    3d0621eb72c9dd1c036505363032468a9016f381
  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.1-x86_64 | grep oauth-proxy
  oauth-proxy                                    https://github.com/openshift/oauth-proxy                                    3d0621eb72c9dd1c036505363032468a9016f381

  So both RCs have OAuth fix, but neither has the service-ca-operator fix.

* 4.3: Introduced by bug 1788179, service-ca-operator 8395d65 [6]. Fixed by bug 1810420, service-ca-operator dd7235b [7], which includes library-go 5844159 [8].

  Fix has not been released yet.

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.3.3-x86_64 | grep service-ca-operator
    service-ca-operator                           https://github.com/openshift/service-ca-operator                           774c394da334dec446703545d4baaf89611ccb9d
  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.3.5-x86_64 | grep service-ca-operator
    service-ca-operator                           https://github.com/openshift/service-ca-operator                           8395d65888b0a4249277989f18ee03f45383e409

  So this was introduced in 4.3.5 (there was no 4.3.4).

  Fix also requires the OAuth proxy fix in bug 1809253 and [9], which is still in flight.

* 4.2: Introduced by bug 1774156, service-ca-operator 0324055 [10], which includes library-go 2cf86bb [11] and API 8ce0047 [12].  Fix in flight with bug 1810421 and [13].  [14] has already landed with library-go d58edcb.

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.21-x86_64 | grep service-ca-operator
    service-ca-operator                           https://github.com/openshift/service-ca-operator                           f6720573b9b63147436374e51e6fda44683b1e9f
  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.22-x86_64 | grep service-ca-operator
    service-ca-operator                           https://github.com/openshift/service-ca-operator                           0324055c3bad3a857dcf3471c024bf42c20d549e

  So this was introduced in 4.2.22.

  Fix also requires the OAuth proxy fix from bug 1809258 and [15], which is still in flight.

* 4.1: Backport stream introducing the bug 1774157 is still ASSIGNED, so no 4.1 impact yet.

[1]: https://github.com/openshift/service-ca-operator/pull/110#event-3111531432
[2]: https://github.com/openshift/library-go/pull/726#event-3106684443
[3]: https://github.com/openshift/oauth-proxy/pull/152#event-3029892031
[4]: https://github.com/openshift/service-ca-operator/pull/111#event-3132963585
[5]: https://github.com/openshift/library-go/pull/728#event-3129427368
[6]: https://github.com/openshift/service-ca-operator/pull/104#event-3053794085
[7]: https://github.com/openshift/service-ca-operator/pull/112#event-3142240318
[8]: https://github.com/openshift/library-go/pull/729#event-3139571599
[9]: https://github.com/openshift/oauth-proxy/pull/160
[10]: https://github.com/openshift/service-ca-operator/pull/105#event-3076020193
[11]: https://github.com/openshift/library-go/pull/684#event-3059339775
[12]: https://github.com/openshift/api/pull/577#event-3061441773
[13]: https://github.com/openshift/service-ca-operator/pull/113
[14]: https://github.com/openshift/library-go/pull/730#event-3141931034
[15]: https://github.com/openshift/oauth-proxy/pull/164

Comment 12 Maru Newby 2020-04-17 16:05:34 UTC
Given that 4.5 is not yet released, and release versions already have a fix, I'm assuming no docs are required.

Comment 14 errata-xmlrpc 2020-07-13 17:17:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.