+++ This bug was initially created as a clone of Bug #2037272 +++ Starting with Go 1.17 support for invalid certificates is going to be removed, see https://go.dev/doc/go1.17. This means that legacy certificates not having a SAN field but relying on the CN field will not be accepted by Go 1.17 based TLS clients any more. The temporary `GODEBUG=x509ignoreCN=0` environment variable has been removed as of Go 1.17. --- Additional comment from Sergiusz Urbaniak on 2022-01-05 11:01:43 UTC --- Closing as with CURRENTRELEASE resolution as we can only implement preventive fixes in 4.9 only.
As discussed, this bz needs further comprehensive verification
Hoo, this bug is more like a complex epic! It contains so many OpenShift PRs and OEP https://github.com/openshift/enhancements/pull/980 ! I read through them and digested them now. To verify it, needs customize various certs like KAS cert, Apiserver webhook cert, aggregated apiserver cert, component routes like oauth route cert, various Auth external IDP servers' certs like OpenID server certs and test them all. Big test work! Apiserver webhook cert and aggregated cert detection is from upstream k8s, and may need Apiserver QE teammates to provide help to test apiserver_webhooks_x509_missing_san_total and apiserver_kube_aggregator_x509_missing_san_total, Sergiusz, do you think we need to test them again? For today, after investigating the PRs and code, time is only enough for me to test one of them, the test steps are lengthy, see next comment.
Continuing previous comment for Auth IDP no-SAN cert: Prepare cert without SAN: mkdir test_cert_no_san cd test_cert_no_san openssl genrsa -out serverKey.pem 2048 cat > server_no_san.conf << EOF [req] req_extensions = v3_req distinguished_name = req_distinguished_name [req_distinguished_name] [ v3_req ] basicConstraints = CA:FALSE keyUsage = nonRepudiation, digitalSignature, keyEncipherment extendedKeyUsage = clientAuth, serverAuth EOF openssl req -new -key serverKey.pem -out serverNoSAN.csr -subj "/CN=*.apps.SNIPPED.qe.rhcloud.com" -config server_no_san.conf openssl x509 -req -in serverNoSAN.csr -CA caCert.pem -CAkey caKey.pem -CAcreateserial -out serverCertNoSAN.pem -days 100000 -extensions v3_req -extfile server_no_san.conf openssl x509 -noout -text -in serverCertNoSAN.pem # no SAN indeed ... Subject: CN=*.apps.SNIPPED.qe.rhcloud.com ... X509v3 extensions: X509v3 Basic Constraints: CA:FALSE X509v3 Key Usage: Digital Signature, Non Repudiation, Key Encipherment X509v3 Extended Key Usage: TLS Web Client Authentication, TLS Web Server Authentication ... Make the cert used in Auth OpenID server: oc new-project keycloak oc process -f https://raw.githubusercontent.com/keycloak/keycloak-quickstarts/latest/openshift-examples/keycloak.yaml \ -p KEYCLOAK_USER=admin \ -p KEYCLOAK_PASSWORD=SNIPPED \ -p NAMESPACE=keycloak \ | oc create -f - oc create secret tls keycloak-x509-https --cert serverCertNoSAN.pem --key serverKey.pem oc set volume dc/keycloak --add --name volume1 -t secret -m /etc/x509/https --secret-name keycloak-x509-https Login keycloak admin UI, create a client like OCP-41149 keycloak case. echo -n 'KEYCLOAK_CLIENT_SECRET' > keycloak-oidc-secret.txt oc create secret generic keycloak-oidc-secret --from-file=clientSecret=keycloak-oidc-secret.txt -n openshift-config oc create configmap keycloak-oidc-ca --from-file=ca.crt=caCert.pem -n openshift-config oc edit oauth cluster ... - mappingMethod: claim name: keycloak-oidc openID: ca: name: keycloak-oidc-ca claims: name: - name preferredUsername: - preferred_username - username - name clientID: myclient clientSecret: name: keycloak-oidc-secret extraScopes: [] issuer: https://keycloak-keycloak.apps.SNIPPED/auth/realms/master type: OpenID ... Wait for oauth pods renew, then run: oc login -u admin -p SNIPPED # succeeded, IDP is working oc describe co authentication ... Status: Conditions: ... Last Transition Time: 2022-02-08T15:30:49Z Message: InvalidProviderInvalidCertsUpgradeable: Server certificates without SAN detected: {provider="OpenID"}. These have to be replaced to include the respective hosts in their SAN extension and not rely on the Subject's CN for the purpose of hostname verification. Reason: InvalidProviderInvalidCerts_InvalidCertsDetected Status: False Type: Upgradeable Login to Prometheus UI, check openshift_auth_x509_missing_san_total, there is a non-zero count for OpenID: ... openshift_auth_x509_missing_san_total{container="oauth-openshift", endpoint="https", instance="10.129.0.48:6443", job="oauth-openshift", namespace="openshift-authentication", pod="oauth-openshift-68dd878759-kng5z", provider="LDAP", service="oauth-openshift"} openshift_auth_x509_missing_san_total{container="oauth-openshift", endpoint="https", instance="10.129.0.48:6443", job="oauth-openshift", namespace="openshift-authentication", pod="oauth-openshift-68dd878759-kng5z", provider="OpenID", service="oauth-openshift"} 2 ... That is, invalid Auth IDP cert is detected and Upgradeable is False to prevent user to upgrade to 4.10, this verifies the auth PRs.
Tested oauth-openshift route cert customization, filed separate bug 2052467 to track it after confirming with Dev. For apiserver_webhooks_x509_missing_san_total and apiserver_kube_aggregator_x509_missing_san_total test, asked Apiserver QE guys to help test, testing is on going.
Recover tests confirmed that the kube-apiserver fails to recover when server certificates without SAN detected in webhooks replaced by certificates with SAN. kube-apiserver has set to "upgradable:false" as the server certificates without SAN detected in webhooks. To recover from this sitaition, replaced the certificate with SAN and waited the status to clear from the kube-apiserver. kube-apiserver fails to clear the status "upgradable:false" even after longer wait. It has also noted that the mertic "apiserver_webhooks_x509_missing_san_total" value hasn't reset to zero during above situation. Please find the steps to reproduce the issue below. 1. Pre-condition : Create kube-apiserver status "upgradable:false" by referring certificates without SAN. Follow the steps in https://bugzilla.redhat.com/show_bug.cgi?id=2037274#c13 2.Create certificates with SAN openssl genrsa -out caKey.pem 2048 openssl req -x509 -new -nodes -key caKey.pem -days 100000 -out caCert.pem -subj "/CN=wb_ca" openssl genrsa -out serverKey.pem 2048 cat > server.conf << EOF [req] req_extensions = v3_req distinguished_name = req_distinguished_name [req_distinguished_name] [ v3_req ] basicConstraints = CA:FALSE keyUsage = nonRepudiation, digitalSignature, keyEncipherment extendedKeyUsage = clientAuth, serverAuth subjectAltName = @alt_names [alt_names] IP.1 = 127.0.0.1 DNS.1 = opa.opa.svc EOF openssl req -new -key serverKey.pem -out serverWithSAN.csr -subj "/CN=opa.opa.svc" -config server.conf openssl x509 -req -in serverWithSAN.csr -CA caCert.pem -CAkey caKey.pem -CAcreateserial -out serverCertWithSAN.pem -days 100000 -extensions v3_req -extfile server.conf 3. Delete existing tls secret oc delete secrets opa-server -n opa 4. Create new secret with SAN cert oc create secret tls opa-server --cert=serverCertWithSAN.pem --key=serverKey.pem -n opa 5. Delete webhook and recreate with certificates with SAN $ oc delete ValidatingWebhookConfiguration opa-validating-webhook $ cat > webhook-configuration.yaml <<EOF kind: ValidatingWebhookConfiguration apiVersion: admissionregistration.k8s.io/v1 metadata: name: opa-validating-webhook webhooks: - name: validating-webhook.openpolicyagent.org admissionReviewVersions: - v1beta1 sideEffects: None namespaceSelector: matchExpressions: - key: openpolicyagent.org/webhook operator: In values: - ignore rules: - operations: ["CREATE", "UPDATE"] apiGroups: ["*"] apiVersions: ["*"] resources: ["*"] scope: 'Namespaced' clientConfig: caBundle: $(cat serverCertWithSAN.pem | base64 | tr -d '\n') service: namespace: opa name: opa EOF $ oc apply -f webhook-configuration.yaml 6. Restart the opa pods to get the new secret changes. $ oc delete replicaset <replica_name> -n opa 7. Describe the kube-apiserver to see the status. $ oc describe co/kube-apiserver Name: kube-apiserver ........ Last Transition Time: 2022-02-11T06:58:15Z Message: InvalidCertsUpgradeable: Server certificates without SAN detected: {type="webhooks"}. These have to be replaced to include the respective hosts in their SAN extension and not rely on the Subject's CN for the purpose of hostname verification. Reason: InvalidCerts_InvalidCertsDetected Status: False Type: Upgradeable Extension: <nil> Check prometheus UI, for apiserver_webhooks_x509_missing_san_total matric value. apiserver_webhooks_x509_missing_san_total{apiserver="kube-apiserver", endpoint="https", instance="10.0.0.6:6443", job="apiserver", namespace="default", service="kubernetes"} 9 apiserver_webhooks_x509_missing_san_total{apiserver="kube-apiserver", endpoint="https", instance="10.0.0.7:6443", job="apiserver", namespace="default", service="kubernetes"} 4 apiserver_webhooks_x509_missing_san_total{apiserver="kube-apiserver", endpoint="https", instance="10.0.0.8:6443", job="apiserver", namespace="default", service="kubernetes"} 33 As you see above, the kube-apiserver failed to recover from the current status (i.e. upgradable:false due to certificates without SAN in webhook even after replaced by certificates with SAN). Also note that in prometheus UI, the apiserver_webhooks_x509_missing_san_total count wasn't reset to zero.
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.22 True False 13h Cluster version is 4.9.22 Steps see below, 1. Create certificates with no SAN: $ git clone git:wangke19/sample-apiserver.git # Some changes adapt to current testing on OpenShift $ cd sample-apiserver $ mkdir certs $ cd certs $ openssl genrsa -out caKey.pem 2048 $ openssl req -x509 -new -nodes -key caKey.pem -days 100000 -out caCert.pem -subj "/CN=system:openshift-aggregator" $ openssl genrsa -out serverKey.pem 2048 $ cat > server_no_san.conf << EOF [req] req_extensions = v3_req distinguished_name = req_distinguished_name [req_distinguished_name] [ v3_req ] basicConstraints = CA:FALSE keyUsage = nonRepudiation, digitalSignature, keyEncipherment extendedKeyUsage = clientAuth, serverAuth EOF $ openssl req -new -key serverKey.pem -out serverNoSAN.csr -subj "/CN=localhost" -config server_no_san.conf $ openssl x509 -req -in serverNoSAN.csr -CA caCert.pem -CAkey caKey.pem -CAcreateserial -out serverCertNoSAN.pem -days 100000 -extensions v3_req -extfile server_no_san.conf $ openssl x509 -noout -text -in serverCertNoSAN.pem Certificate: ... Issuer: CN = system:openshift-aggregator ... Subject: CN = localhost ... X509v3 extensions: X509v3 Basic Constraints: CA:FALSE X509v3 Key Usage: Digital Signature, Non Repudiation, Key Encipherment X509v3 Extended Key Usage: TLS Web Client Authentication, TLS Web Server Authentication 2. Deploy the sample-apiserver to cluster. $ oc create secret tls aggserver --cert=serverCertNoSAN.pem --key=serverKey.pem $ cd sample-apiserver/artifacts/example Specified tls cert and key for aggregator sample apiserver, $ cat deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: wardle-server namespace: default labels: apiserver: "true" spec: replicas: 1 selector: matchLabels: apiserver: "true" template: metadata: labels: apiserver: "true" spec: serviceAccountName: apiserver containers: - name: wardle-server # build from staging/src/k8s.io/sample-apiserver/artifacts/simple-image/Dockerfile # or # docker pull k8s.gcr.io/e2e-test-images/sample-apiserver:1.17.4 # docker tag k8s.gcr.io/e2e-test-images/sample-apiserver:1.17.4 kube-sample-apiserver:latest image: quay.io/wangke19/kube-sample-apiserver imagePullPolicy: IfNotPresent volumeMounts: - readOnly: true mountPath: /certs name: apiserver args: - "--etcd-servers=http://localhost:2379" - "--tls-cert-file=/certs/tls.crt" - "--tls-private-key-file=/certs/tls.key" - name: etcd image: quay.io/coreos/etcd:v3.5.1 volumes: - name: apiserver secret: secretName: aggserver $ cat apiservice.yaml apiVersion: apiregistration.k8s.io/v1 kind: APIService metadata: name: v1alpha1.wardle.example.com spec: insecureSkipTLSVerify: true group: wardle.example.com groupPriorityMinimum: 1000 versionPriority: 15 service: name: api namespace: default clientConfig: caBundle: $(cat ../../certs/serverCertNoSAN.pem | base64 | tr -d '\n') version: v1alpha1 $ oc apply -f artifacts/example $ oc get apiservice/v1alpha1.wardle.example.com NAME SERVICE AVAILABLE AGE v1alpha1.wardle.example.com default/api True 62s ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3. Verification for apiserver_kube_aggregator_x509_missing_san metrics, $ oc describe co kube-apiserver | grep -B5 'Extension:' Last Transition Time: 2022-03-01T02:38:25Z Message: InvalidCertsUpgradeable: Server certificates without SAN detected: {type="aggregation"}. These have to be replaced to include the respective hosts in their SAN extension and not rely on the Subject's CN for the purpose of hostname verification. Reason: InvalidCerts_InvalidCertsDetected Status: False Type: Upgradeable Extension: <nil> token=`oc sa get-token cluster-monitoring-operator -n openshift-monitoring` pod_name=$(oc get pods -n openshift-kube-apiserver | grep -i 'running' | head -1 | cut -d " " -f1) op_endp=$(oc get ep -n openshift-kube-apiserver | grep 6443 | awk '{print $2}'| cut -d "," -f1) $ oc -n openshift-kube-apiserver exec -it $pod_name -- curl -k -H "Authorization: Bearer $token" https://$op_endp/metrics | grep aggregator_x509_missing_san_total ... # HELP apiserver_kube_aggregator_x509_missing_san_total [ALPHA] Counts the number of requests to servers missing SAN extension in their serving certificate OR the number of connection failures due to the lack of x509 certificate SAN extension missing (either/or, based on the runtime environment) # TYPE apiserver_kube_aggregator_x509_missing_san_total counter apiserver_kube_aggregator_x509_missing_san_total 7 From above, aggregator sample apiserver missing SAN extension cert is detected and Upgradeable is False to prevent user to upgrade to 4.10, that's as expected.
For apiserver_kube_aggregator_x509_missing_san_total test, retested comment 15 with latest 4.9 payload, got the same results as expected. Will verify the cert with SAN for aggregator apiserver, $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2022-03-15-215745 True False 7h2m Cluster version is 4.9.0-0.nightly-2022-03-15-215745 - Update cert to use SAN and make component pod reload new cert $ cd sample-apiserver $ mkdir san_certs $ cd san_certs $ openssl genrsa -out caKey.pem 2048 $ openssl req -x509 -new -nodes -key caKey.pem -days 100000 -out caCert.pem -subj "/CN=system:openshift-aggregator" $ openssl genrsa -out serverKey.pem 2048 $ cat > server_no_san.conf << EOF [req] req_extensions = v3_req distinguished_name = req_distinguished_name [req_distinguished_name] [ v3_req ] basicConstraints = CA:FALSE keyUsage = nonRepudiation, digitalSignature, keyEncipherment extendedKeyUsage = clientAuth, serverAuth subjectAltName = @alt_names [alt_names] IP.1 = 127.0.0.1 DNS.1 = localhost EOF $ openssl req -new -key serverKey.pem -out serverSAN.csr -subj "/CN=localhost" -config server.conf $ openssl x509 -req -in serverSAN.csr -CA caCert.pem -CAkey caKey.pem -CAcreateserial -out serverCertSAN.pem -days 100000 -extensions v3_req -extfile server.conf $ openssl x509 -noout -text -in serverCertSAN.pem Certificate: ... Issuer: CN = system:openshift-aggregator ... Subject: CN = localhost ... X509v3 Subject Alternative Name: IP Address:127.0.0.1, DNS:localhost $ oc create secret tls aggserver --cert=serverCertSAN.pem --key=serverKey.pem secret/aggserver created Edit the apiservice.yaml, change caBundle to Cert with SAN and save. $ cd artifacts/example $ vi apiservice.yaml ... clientConfig: caBundle: $(cat ../../certs/serverCertSAN.pem | base64 | tr -d '\n') $ cd ../../ $ oc apply -f artifacts/example apiservice.apiregistration.k8s.io/v1alpha1.wardle.example.com created clusterrolebinding.rbac.authorization.k8s.io/wardle:system:auth-delegator created rolebinding.rbac.authorization.k8s.io/wardle-auth-reader created deployment.apps/wardle-server created clusterrolebinding.rbac.authorization.k8s.io/sample-apiserver-clusterrolebinding created clusterrole.rbac.authorization.k8s.io/aggregated-apiserver-clusterrole created serviceaccount/apiserver created service/api created $ oc get apiservice/v1alpha1.wardle.example.com NAME SERVICE AVAILABLE AGE v1alpha1.wardle.example.com default/api True 48s - Describe the kube-apiserver to see the status, $ oc describe co kube-apiserver | grep -B5 'Extension:' Last Transition Time: 2022-03-16T07:56:29Z Message: InvalidCertsUpgradeable: Server certificates without SAN detected: {type="aggregation"}. These have to be replaced to include the respective hosts in their SAN extension and not rely on the Subject's CN for the purpose of hostname verification. Reason: InvalidCerts_InvalidCertsDetected Status: False Type: Upgradeable Extension: <nil> After a few minutes ,Upgradeable should be changed from False to True, $ oc describe co kube-apiserver | grep -B5 'Extension:' Last Transition Time: 2022-03-16T09:24:29Z Message: KubeletMinorVersionUpgradeable: Kubelet and API server minor versions are synced. Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> - Check prometheus UI, for apiserver_kube_aggregator_x509_missing_san_total matric value, and wait more than 5m, check it again, the count still keeps the same results without any changes. apiserver_kube_aggregator_x509_missing_san_total kube-apiserver https 10.0.155.45:6443 apiserver default openshift-monitoring/k8s kubernetes 72 apiserver_kube_aggregator_x509_missing_san_total kube-apiserver https 10.0.171.167:6443 apiserver default openshift-monitoring/k8s kubernetes 111 apiserver_kube_aggregator_x509_missing_san_total kube-apiserver https 10.0.194.169:6443 apiserver default openshift-monitoring/k8s kubernetes 533 Based on above, the kube-apiserver verified the aggregator apiserver using cert w or w/o SAN as expected.
Verified openshift_auth_x509_missing_san_total in 4.9.0-0.nightly-2022-03-15-215745 . To avoid the bug is further too crowded and lengthy, the verification steps are put in doc https://docs.google.com/document/d/1E0s2jp_zpeJGPOZ-iSBe1sYFvK1C659Ut-X0y7JpQ6c/edit#heading=h.aipr2g1orltn "Verification for openshift_auth_x509_missing_san_total" section. Conclusion: when SAN-less cert is used in IDP server, ' sum by (provider) (increase(openshift_auth_x509_missing_san_total{job="oauth-openshift",namespace="openshift-authentication"}[30m])) or on() vector(0) ' is positive and will trigger Upgradeable as False; when the IDP server is updated with SAN cert, and wait for 30mins, the query is down to zero and Upgradeable turns to True again. Note: the fix has a flaw, that is, even if we do nothing but just wait for 30 mins, the operator's Upgradeable can still turn to True. This should be a flaw that can't be fixed in theory.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.9.25 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0861
*** Bug 2054256 has been marked as a duplicate of this bug. ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days