Description of problem: 4.3.3 cluster enabled FIPS, after upgrade to 4.4.0-0.nightly-2020-03-04-000622, curl the https endpoint, it reports error "You are attempting to import a cert with the same issuer/serial as an existing cert, but that is not the same cert", the http endpoint does not have this issue since it does not need cert. NOTE: It seems this issue does not affect the cluster functions ********************************************* # oc -n openshift-apiserver get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES apiserver-88b6f947-7j4dz 1/1 Running 0 116m 10.128.0.8 qe-jia-nb8tg-m-1.c.openshift-qe.internal <none> <none> apiserver-88b6f947-lp624 1/1 Running 0 116m 10.130.0.41 qe-jia-nb8tg-m-2.c.openshift-qe.internal <none> <none> apiserver-88b6f947-z7wg6 1/1 Running 0 117m 10.129.0.29 qe-jia-nb8tg-m-0.c.openshift-qe.internal <none> <none> # oc -n openshift-apiserver get ep NAME ENDPOINTS AGE api 10.128.0.8:8443,10.129.0.29:8443,10.130.0.41:8443 4h34m # oc -n openshift-apiserver exec apiserver-88b6f947-7j4dz -- curl -k 'https://10.128.0.8:8443/metrics' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 curl: (35) You are attempting to import a cert with the same issuer/serial as an existing cert, but that is not the same cert. FYI, the normal result should be like the followings on FIPS disabled cluster since it needs sa token ********************************* # oc -n openshift-apiserver exec apiserver-5cfd58bdbd-m86n6 -- curl -k 'https://10.128.0.39:8443/metics' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 239 100 239 0 0 1531 0 --:-{:-- --:--:-- --:--:-- 0 "kind": "Status", "apiVersion": "v1", "metadata": { }, "status": "Failure", "message": "forbidden: User \"system:anonymous\" cannot get path \"/metics\"", "reason": "Forbidden", "details": { }, "code": 403 } ********************************* the same FIPS enabled cluster, http does not have issue # oc -n openshift-cluster-version get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-version-operator-77fbd67d45-fzcts 1/1 Running 0 119m 10.0.0.6 qe-jia-nb8tg-m-2.c.openshift-qe.internal <none> <none> # oc -n openshift-cluster-version get ep NAME ENDPOINTS AGE cluster-version-operator 10.0.0.6:9099 4h48m # oc -n openshift-cluster-version exec cluster-version-operator-77fbd67d45-fzcts -- curl 'http://10.0.0.6:9099/metrics' ... # TYPE cluster_installer gauge cluster_installer{invoker="user",type="openshift-install",version="v4.3.3"} 1 # HELP cluster_operator_condition_transitions Reports the number of times that a condition on a cluster operator changes status # TYPE cluster_operator_condition_transitions gauge cluster_operator_condition_transitions{condition="Available",name="console"} 1 cluster_operator_condition_transitions{condition="Available",name="ingress"} 2 ********************************************* Version-Release number of selected component (if applicable): 4.3.3 cluster enabled FIPS, and upgrade to 4.4.0-0.nightly-2020-03-04-000622 How reproducible: always Steps to Reproduce: 1. See the description 2. 3. Actual results: Expected results: Additional info:
Created attachment 1667540 [details] CA cert chain
Confirmed, this issue has nothing to do with FIPS, it is a regression bug after upgrade
Bumping to urgent. This needs to be backported as far back as 4.2 ASAP to avoid potentially impacting anyone upgrading to a rotation-supporting zstream release. On the bright side, this problem doesn't appear in golang code so it's only non-golang code that is likely to be impacted.
Copying the assessment from the downstream 4.3 bug 1810420 since this is the bug that's referenced in the blocked edges. Who is impacted? All customers upgrading from to 4.3.5 that run workloads which use service-ca for certs. All customers that install fresh 4.3.5 cluster will be affected on rotation after 13 months if they don't upgrade. What is the impact? All workloads that use non-golang SSL network clients which use service-ca to communicate with platform or between each-other (eg. curl). How involved is remediation? Manual rotation of service-ca will fix the cluster for the next rotation (13 months). Is this a regression? Yes, this was introduced as part of the automated service-ca rotation and released on March 10 in 4.3.5. > Can you please add the step for manual workaround. It would be useful for CEE https://docs.openshift.com/container-platform/4.3/authentication/certificates/service-serving-certificate.html#manually-rotate-service-ca_service-serving-certificate
Is this only an issue if you have FIPs enabled Can FIPS only be enabled if you install with 4.3 and enable it at install time? So if you have fips enabled the workaround is to rotate manually following these steps?: https://docs.openshift.com/container-platform/4.3/authentication/certificates/service-serving-certificate.html#manually-rotate-service-ca_service-serving-certificate Do we really want to point out a command that will delete every pod in the cluster... (well try, it will fail likely causing the cluster to fall over for a time being)
(In reply to Ryan Howe from comment #9) > Is this only an issue if you have FIPs enabled > Can FIPS only be enabled if you install with 4.3 and enable it at install > time? This issue is not limited to a cluster with FIPS enabled. It affects any cluster that is upgraded to a release that enables automated service CA rotation without also ensuring unique CA serial numbers. > So if you have fips enabled the workaround is to rotate manually following > these steps?: > > https://docs.openshift.com/container-platform/4.3/authentication/ > certificates/service-serving-certificate.html#manually-rotate-service- > ca_service-serving-certificate > > Do we really want to point out a command that will delete every pod in the > cluster... (well try, it will fail likely causing the cluster to fall over > for a time being) The provided link is the documented procedure for manual CA rotation. Manual deletion of all pods will disrupt all services in the cluster - including the control plane - but the cluster will recover. It's similar to the node drain that occurs on every upgrade.
Expanding on the "Who is impacted?" from comment 8, so we know which update recommendations to pull and which releases to tombstone: The bugs were introduced by the bug 1774121 series, and fixed by the combination of this series and bug 1801573. Quick overview: * 4.4: both rc.0 and rc.1 affected, so updates into rc.0 and tombstone rc.1 are impacted (and running either RC for 13+ months will also hit a broken CA rotation). Fixes have landed, so next 4.4 RC should be clean. * 4.3: 4.3.5 introduced the breakage, so updated into 4.3.5 are impacted. No fix yet. * 4.2: 4.2.22 introduced the breakage, so updates into 4.2.22, 4.2.23, and 4.2.24 are impacted. No fix yet. * 4.1: not impacted yet. Bug 1774157 was backporting the breaking change, and is still ASSIGNED. Reasoning behind the overview's claims: * 4.5: Introduced by bug 1774121 (no linked PR, so not sure exactly when it was introduced). Fixed by bug 1810036, service-ca-operator 74b5ce2 [1], which included library-go d9c73bb [2]. Also fixed by bug 1801573, oauth-proxy 3d0621e [3], which landed before the 4.4/4.5 split. * 4.4: Introduced by bug 1774121 (no linked PR, so not sure exactly when it was introduced). Fixed by bug 1810418, service-ca-operator e5a04d6 [4], which included library-go 3c25293 [5]. $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.0-x86_64 | grep service-ca-operator service-ca-operator https://github.com/openshift/service-ca-operator 094a9ad02dbe3bcb57d5fbad301cfcfcd48bd2ed $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.1-x86_64 | grep service-ca-operator service-ca-operator https://github.com/openshift/service-ca-operator 094a9ad02dbe3bcb57d5fbad301cfcfcd48bd2ed $ git --no-pager log -2 --first-parent --oneline origin/release-4.4 e5a04d6a (origin/release-4.4) Merge pull request #111 from marun/4.4-unique-ca-serial 094a9ad0 Merge pull request #95 from vareti/signer-ca-metrics So both RCs are affected. Also fixed by bug 1801573, oauth-proxy 3d0621e [3], which landed before the 4.4/4.5 split. $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.0-x86_64 | grep oauth-proxy oauth-proxy https://github.com/openshift/oauth-proxy 3d0621eb72c9dd1c036505363032468a9016f381 $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.1-x86_64 | grep oauth-proxy oauth-proxy https://github.com/openshift/oauth-proxy 3d0621eb72c9dd1c036505363032468a9016f381 So both RCs have OAuth fix, but neither has the service-ca-operator fix. * 4.3: Introduced by bug 1788179, service-ca-operator 8395d65 [6]. Fixed by bug 1810420, service-ca-operator dd7235b [7], which includes library-go 5844159 [8]. Fix has not been released yet. $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.3.3-x86_64 | grep service-ca-operator service-ca-operator https://github.com/openshift/service-ca-operator 774c394da334dec446703545d4baaf89611ccb9d $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.3.5-x86_64 | grep service-ca-operator service-ca-operator https://github.com/openshift/service-ca-operator 8395d65888b0a4249277989f18ee03f45383e409 So this was introduced in 4.3.5 (there was no 4.3.4). Fix also requires the OAuth proxy fix in bug 1809253 and [9], which is still in flight. * 4.2: Introduced by bug 1774156, service-ca-operator 0324055 [10], which includes library-go 2cf86bb [11] and API 8ce0047 [12]. Fix in flight with bug 1810421 and [13]. [14] has already landed with library-go d58edcb. $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.21-x86_64 | grep service-ca-operator service-ca-operator https://github.com/openshift/service-ca-operator f6720573b9b63147436374e51e6fda44683b1e9f $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.2.22-x86_64 | grep service-ca-operator service-ca-operator https://github.com/openshift/service-ca-operator 0324055c3bad3a857dcf3471c024bf42c20d549e So this was introduced in 4.2.22. Fix also requires the OAuth proxy fix from bug 1809258 and [15], which is still in flight. * 4.1: Backport stream introducing the bug 1774157 is still ASSIGNED, so no 4.1 impact yet. [1]: https://github.com/openshift/service-ca-operator/pull/110#event-3111531432 [2]: https://github.com/openshift/library-go/pull/726#event-3106684443 [3]: https://github.com/openshift/oauth-proxy/pull/152#event-3029892031 [4]: https://github.com/openshift/service-ca-operator/pull/111#event-3132963585 [5]: https://github.com/openshift/library-go/pull/728#event-3129427368 [6]: https://github.com/openshift/service-ca-operator/pull/104#event-3053794085 [7]: https://github.com/openshift/service-ca-operator/pull/112#event-3142240318 [8]: https://github.com/openshift/library-go/pull/729#event-3139571599 [9]: https://github.com/openshift/oauth-proxy/pull/160 [10]: https://github.com/openshift/service-ca-operator/pull/105#event-3076020193 [11]: https://github.com/openshift/library-go/pull/684#event-3059339775 [12]: https://github.com/openshift/api/pull/577#event-3061441773 [13]: https://github.com/openshift/service-ca-operator/pull/113 [14]: https://github.com/openshift/library-go/pull/730#event-3141931034 [15]: https://github.com/openshift/oauth-proxy/pull/164
Given that 4.5 is not yet released, and release versions already have a fix, I'm assuming no docs are required.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409