Bug 2037274
| Summary: | legacy certificates missing SAN entries render the cluster dysfunctional | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Sergiusz Urbaniak <surbania> |
| Component: | apiserver-auth | Assignee: | Sergiusz Urbaniak <surbania> |
| Status: | CLOSED ERRATA | QA Contact: | Xingxing Xia <xxia> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 4.9 | CC: | aos-bugs, jmekkatt, kewang, mfojtik, rgangwar, rugouvei, slaznick, surbania, wking, xxia, ytripath |
| Target Milestone: | --- | Flags: | ytripath:
needinfo-
|
| Target Release: | 4.9.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 2037272 | Environment: | |
| Last Closed: | 2022-03-21 12:30:12 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2031839 | ||
| Bug Blocks: | |||
|
Description
Sergiusz Urbaniak
2022-01-05 11:03:01 UTC
As discussed, this bz needs further comprehensive verification Hoo, this bug is more like a complex epic! It contains so many OpenShift PRs and OEP https://github.com/openshift/enhancements/pull/980 ! I read through them and digested them now. To verify it, needs customize various certs like KAS cert, Apiserver webhook cert, aggregated apiserver cert, component routes like oauth route cert, various Auth external IDP servers' certs like OpenID server certs and test them all. Big test work! Apiserver webhook cert and aggregated cert detection is from upstream k8s, and may need Apiserver QE teammates to provide help to test apiserver_webhooks_x509_missing_san_total and apiserver_kube_aggregator_x509_missing_san_total, Sergiusz, do you think we need to test them again? For today, after investigating the PRs and code, time is only enough for me to test one of them, the test steps are lengthy, see next comment. Continuing previous comment for Auth IDP no-SAN cert:
Prepare cert without SAN:
mkdir test_cert_no_san
cd test_cert_no_san
openssl genrsa -out serverKey.pem 2048
cat > server_no_san.conf << EOF
[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = clientAuth, serverAuth
EOF
openssl req -new -key serverKey.pem -out serverNoSAN.csr -subj "/CN=*.apps.SNIPPED.qe.rhcloud.com" -config server_no_san.conf
openssl x509 -req -in serverNoSAN.csr -CA caCert.pem -CAkey caKey.pem -CAcreateserial -out serverCertNoSAN.pem -days 100000 -extensions v3_req -extfile server_no_san.conf
openssl x509 -noout -text -in serverCertNoSAN.pem # no SAN indeed
...
Subject: CN=*.apps.SNIPPED.qe.rhcloud.com
...
X509v3 extensions:
X509v3 Basic Constraints:
CA:FALSE
X509v3 Key Usage:
Digital Signature, Non Repudiation, Key Encipherment
X509v3 Extended Key Usage:
TLS Web Client Authentication, TLS Web Server Authentication
...
Make the cert used in Auth OpenID server:
oc new-project keycloak
oc process -f https://raw.githubusercontent.com/keycloak/keycloak-quickstarts/latest/openshift-examples/keycloak.yaml \
-p KEYCLOAK_USER=admin \
-p KEYCLOAK_PASSWORD=SNIPPED \
-p NAMESPACE=keycloak \
| oc create -f -
oc create secret tls keycloak-x509-https --cert serverCertNoSAN.pem --key serverKey.pem
oc set volume dc/keycloak --add --name volume1 -t secret -m /etc/x509/https --secret-name keycloak-x509-https
Login keycloak admin UI, create a client like OCP-41149 keycloak case.
echo -n 'KEYCLOAK_CLIENT_SECRET' > keycloak-oidc-secret.txt
oc create secret generic keycloak-oidc-secret --from-file=clientSecret=keycloak-oidc-secret.txt -n openshift-config
oc create configmap keycloak-oidc-ca --from-file=ca.crt=caCert.pem -n openshift-config
oc edit oauth cluster
...
- mappingMethod: claim
name: keycloak-oidc
openID:
ca:
name: keycloak-oidc-ca
claims:
name:
- name
preferredUsername:
- preferred_username
- username
- name
clientID: myclient
clientSecret:
name: keycloak-oidc-secret
extraScopes: []
issuer: https://keycloak-keycloak.apps.SNIPPED/auth/realms/master
type: OpenID
...
Wait for oauth pods renew, then run:
oc login -u admin -p SNIPPED # succeeded, IDP is working
oc describe co authentication
...
Status:
Conditions:
...
Last Transition Time: 2022-02-08T15:30:49Z
Message: InvalidProviderInvalidCertsUpgradeable: Server certificates without SAN detected: {provider="OpenID"}. These have to be replaced to include the respective hosts in their SAN extension and not rely on the Subject's CN for the purpose of hostname verification.
Reason: InvalidProviderInvalidCerts_InvalidCertsDetected
Status: False
Type: Upgradeable
Login to Prometheus UI, check openshift_auth_x509_missing_san_total, there is a non-zero count for OpenID:
...
openshift_auth_x509_missing_san_total{container="oauth-openshift", endpoint="https", instance="10.129.0.48:6443", job="oauth-openshift", namespace="openshift-authentication", pod="oauth-openshift-68dd878759-kng5z", provider="LDAP", service="oauth-openshift"}
openshift_auth_x509_missing_san_total{container="oauth-openshift", endpoint="https", instance="10.129.0.48:6443", job="oauth-openshift", namespace="openshift-authentication", pod="oauth-openshift-68dd878759-kng5z", provider="OpenID", service="oauth-openshift"} 2
...
That is, invalid Auth IDP cert is detected and Upgradeable is False to prevent user to upgrade to 4.10, this verifies the auth PRs.
Tested oauth-openshift route cert customization, filed separate bug 2052467 to track it after confirming with Dev. For apiserver_webhooks_x509_missing_san_total and apiserver_kube_aggregator_x509_missing_san_total test, asked Apiserver QE guys to help test, testing is on going. Recover tests confirmed that the kube-apiserver fails to recover when server certificates without SAN detected in webhooks replaced by certificates with SAN. kube-apiserver has set to "upgradable:false" as the server certificates without SAN detected in webhooks. To recover from this sitaition, replaced the certificate with SAN and waited the status to clear from the kube-apiserver. kube-apiserver fails to clear the status "upgradable:false" even after longer wait. It has also noted that the mertic "apiserver_webhooks_x509_missing_san_total" value hasn't reset to zero during above situation. Please find the steps to reproduce the issue below. 1. Pre-condition : Create kube-apiserver status "upgradable:false" by referring certificates without SAN. Follow the steps in https://bugzilla.redhat.com/show_bug.cgi?id=2037274#c13 2.Create certificates with SAN openssl genrsa -out caKey.pem 2048 openssl req -x509 -new -nodes -key caKey.pem -days 100000 -out caCert.pem -subj "/CN=wb_ca" openssl genrsa -out serverKey.pem 2048 cat > server.conf << EOF [req] req_extensions = v3_req distinguished_name = req_distinguished_name [req_distinguished_name] [ v3_req ] basicConstraints = CA:FALSE keyUsage = nonRepudiation, digitalSignature, keyEncipherment extendedKeyUsage = clientAuth, serverAuth subjectAltName = @alt_names [alt_names] IP.1 = 127.0.0.1 DNS.1 = opa.opa.svc EOF openssl req -new -key serverKey.pem -out serverWithSAN.csr -subj "/CN=opa.opa.svc" -config server.conf openssl x509 -req -in serverWithSAN.csr -CA caCert.pem -CAkey caKey.pem -CAcreateserial -out serverCertWithSAN.pem -days 100000 -extensions v3_req -extfile server.conf 3. Delete existing tls secret oc delete secrets opa-server -n opa 4. Create new secret with SAN cert oc create secret tls opa-server --cert=serverCertWithSAN.pem --key=serverKey.pem -n opa 5. Delete webhook and recreate with certificates with SAN $ oc delete ValidatingWebhookConfiguration opa-validating-webhook $ cat > webhook-configuration.yaml <<EOF kind: ValidatingWebhookConfiguration apiVersion: admissionregistration.k8s.io/v1 metadata: name: opa-validating-webhook webhooks: - name: validating-webhook.openpolicyagent.org admissionReviewVersions: - v1beta1 sideEffects: None namespaceSelector: matchExpressions: - key: openpolicyagent.org/webhook operator: In values: - ignore rules: - operations: ["CREATE", "UPDATE"] apiGroups: ["*"] apiVersions: ["*"] resources: ["*"] scope: 'Namespaced' clientConfig: caBundle: $(cat serverCertWithSAN.pem | base64 | tr -d '\n') service: namespace: opa name: opa EOF $ oc apply -f webhook-configuration.yaml 6. Restart the opa pods to get the new secret changes. $ oc delete replicaset <replica_name> -n opa 7. Describe the kube-apiserver to see the status. $ oc describe co/kube-apiserver Name: kube-apiserver ........ Last Transition Time: 2022-02-11T06:58:15Z Message: InvalidCertsUpgradeable: Server certificates without SAN detected: {type="webhooks"}. These have to be replaced to include the respective hosts in their SAN extension and not rely on the Subject's CN for the purpose of hostname verification. Reason: InvalidCerts_InvalidCertsDetected Status: False Type: Upgradeable Extension: <nil> Check prometheus UI, for apiserver_webhooks_x509_missing_san_total matric value. apiserver_webhooks_x509_missing_san_total{apiserver="kube-apiserver", endpoint="https", instance="10.0.0.6:6443", job="apiserver", namespace="default", service="kubernetes"} 9 apiserver_webhooks_x509_missing_san_total{apiserver="kube-apiserver", endpoint="https", instance="10.0.0.7:6443", job="apiserver", namespace="default", service="kubernetes"} 4 apiserver_webhooks_x509_missing_san_total{apiserver="kube-apiserver", endpoint="https", instance="10.0.0.8:6443", job="apiserver", namespace="default", service="kubernetes"} 33 As you see above, the kube-apiserver failed to recover from the current status (i.e. upgradable:false due to certificates without SAN in webhook even after replaced by certificates with SAN). Also note that in prometheus UI, the apiserver_webhooks_x509_missing_san_total count wasn't reset to zero. $ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.9.22 True False 13h Cluster version is 4.9.22
Steps see below,
1. Create certificates with no SAN:
$ git clone git:wangke19/sample-apiserver.git # Some changes adapt to current testing on OpenShift
$ cd sample-apiserver
$ mkdir certs
$ cd certs
$ openssl genrsa -out caKey.pem 2048
$ openssl req -x509 -new -nodes -key caKey.pem -days 100000 -out caCert.pem -subj "/CN=system:openshift-aggregator"
$ openssl genrsa -out serverKey.pem 2048
$ cat > server_no_san.conf << EOF
[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = clientAuth, serverAuth
EOF
$ openssl req -new -key serverKey.pem -out serverNoSAN.csr -subj "/CN=localhost" -config server_no_san.conf
$ openssl x509 -req -in serverNoSAN.csr -CA caCert.pem -CAkey caKey.pem -CAcreateserial -out serverCertNoSAN.pem -days 100000 -extensions v3_req -extfile server_no_san.conf
$ openssl x509 -noout -text -in serverCertNoSAN.pem
Certificate:
...
Issuer: CN = system:openshift-aggregator
...
Subject: CN = localhost
...
X509v3 extensions:
X509v3 Basic Constraints:
CA:FALSE
X509v3 Key Usage:
Digital Signature, Non Repudiation, Key Encipherment
X509v3 Extended Key Usage:
TLS Web Client Authentication, TLS Web Server Authentication
2. Deploy the sample-apiserver to cluster.
$ oc create secret tls aggserver --cert=serverCertNoSAN.pem --key=serverKey.pem
$ cd sample-apiserver/artifacts/example
Specified tls cert and key for aggregator sample apiserver,
$ cat deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: wardle-server
namespace: default
labels:
apiserver: "true"
spec:
replicas: 1
selector:
matchLabels:
apiserver: "true"
template:
metadata:
labels:
apiserver: "true"
spec:
serviceAccountName: apiserver
containers:
- name: wardle-server
# build from staging/src/k8s.io/sample-apiserver/artifacts/simple-image/Dockerfile
# or
# docker pull k8s.gcr.io/e2e-test-images/sample-apiserver:1.17.4
# docker tag k8s.gcr.io/e2e-test-images/sample-apiserver:1.17.4 kube-sample-apiserver:latest
image: quay.io/wangke19/kube-sample-apiserver
imagePullPolicy: IfNotPresent
volumeMounts:
- readOnly: true
mountPath: /certs
name: apiserver
args:
- "--etcd-servers=http://localhost:2379"
- "--tls-cert-file=/certs/tls.crt"
- "--tls-private-key-file=/certs/tls.key"
- name: etcd
image: quay.io/coreos/etcd:v3.5.1
volumes:
- name: apiserver
secret:
secretName: aggserver
$ cat apiservice.yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
name: v1alpha1.wardle.example.com
spec:
insecureSkipTLSVerify: true
group: wardle.example.com
groupPriorityMinimum: 1000
versionPriority: 15
service:
name: api
namespace: default
clientConfig:
caBundle: $(cat ../../certs/serverCertNoSAN.pem | base64 | tr -d '\n')
version: v1alpha1
$ oc apply -f artifacts/example
$ oc get apiservice/v1alpha1.wardle.example.com
NAME SERVICE AVAILABLE AGE
v1alpha1.wardle.example.com default/api True 62s
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3. Verification for apiserver_kube_aggregator_x509_missing_san metrics,
$ oc describe co kube-apiserver | grep -B5 'Extension:'
Last Transition Time: 2022-03-01T02:38:25Z
Message: InvalidCertsUpgradeable: Server certificates without SAN detected: {type="aggregation"}. These have to be replaced to include the respective hosts in their SAN extension and not rely on the Subject's CN for the purpose of hostname verification.
Reason: InvalidCerts_InvalidCertsDetected
Status: False
Type: Upgradeable
Extension: <nil>
token=`oc sa get-token cluster-monitoring-operator -n openshift-monitoring`
pod_name=$(oc get pods -n openshift-kube-apiserver | grep -i 'running' | head -1 | cut -d " " -f1)
op_endp=$(oc get ep -n openshift-kube-apiserver | grep 6443 | awk '{print $2}'| cut -d "," -f1)
$ oc -n openshift-kube-apiserver exec -it $pod_name -- curl -k -H "Authorization: Bearer $token" https://$op_endp/metrics | grep aggregator_x509_missing_san_total
...
# HELP apiserver_kube_aggregator_x509_missing_san_total [ALPHA] Counts the number of requests to servers missing SAN extension in their serving certificate OR the number of connection failures due to the lack of x509 certificate SAN extension missing (either/or, based on the runtime environment)
# TYPE apiserver_kube_aggregator_x509_missing_san_total counter
apiserver_kube_aggregator_x509_missing_san_total 7
From above, aggregator sample apiserver missing SAN extension cert is detected and Upgradeable is False to prevent user to upgrade to 4.10, that's as expected.
For apiserver_kube_aggregator_x509_missing_san_total test, retested comment 15 with latest 4.9 payload, got the same results as expected. Will verify the cert with SAN for aggregator apiserver, $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2022-03-15-215745 True False 7h2m Cluster version is 4.9.0-0.nightly-2022-03-15-215745 - Update cert to use SAN and make component pod reload new cert $ cd sample-apiserver $ mkdir san_certs $ cd san_certs $ openssl genrsa -out caKey.pem 2048 $ openssl req -x509 -new -nodes -key caKey.pem -days 100000 -out caCert.pem -subj "/CN=system:openshift-aggregator" $ openssl genrsa -out serverKey.pem 2048 $ cat > server_no_san.conf << EOF [req] req_extensions = v3_req distinguished_name = req_distinguished_name [req_distinguished_name] [ v3_req ] basicConstraints = CA:FALSE keyUsage = nonRepudiation, digitalSignature, keyEncipherment extendedKeyUsage = clientAuth, serverAuth subjectAltName = @alt_names [alt_names] IP.1 = 127.0.0.1 DNS.1 = localhost EOF $ openssl req -new -key serverKey.pem -out serverSAN.csr -subj "/CN=localhost" -config server.conf $ openssl x509 -req -in serverSAN.csr -CA caCert.pem -CAkey caKey.pem -CAcreateserial -out serverCertSAN.pem -days 100000 -extensions v3_req -extfile server.conf $ openssl x509 -noout -text -in serverCertSAN.pem Certificate: ... Issuer: CN = system:openshift-aggregator ... Subject: CN = localhost ... X509v3 Subject Alternative Name: IP Address:127.0.0.1, DNS:localhost $ oc create secret tls aggserver --cert=serverCertSAN.pem --key=serverKey.pem secret/aggserver created Edit the apiservice.yaml, change caBundle to Cert with SAN and save. $ cd artifacts/example $ vi apiservice.yaml ... clientConfig: caBundle: $(cat ../../certs/serverCertSAN.pem | base64 | tr -d '\n') $ cd ../../ $ oc apply -f artifacts/example apiservice.apiregistration.k8s.io/v1alpha1.wardle.example.com created clusterrolebinding.rbac.authorization.k8s.io/wardle:system:auth-delegator created rolebinding.rbac.authorization.k8s.io/wardle-auth-reader created deployment.apps/wardle-server created clusterrolebinding.rbac.authorization.k8s.io/sample-apiserver-clusterrolebinding created clusterrole.rbac.authorization.k8s.io/aggregated-apiserver-clusterrole created serviceaccount/apiserver created service/api created $ oc get apiservice/v1alpha1.wardle.example.com NAME SERVICE AVAILABLE AGE v1alpha1.wardle.example.com default/api True 48s - Describe the kube-apiserver to see the status, $ oc describe co kube-apiserver | grep -B5 'Extension:' Last Transition Time: 2022-03-16T07:56:29Z Message: InvalidCertsUpgradeable: Server certificates without SAN detected: {type="aggregation"}. These have to be replaced to include the respective hosts in their SAN extension and not rely on the Subject's CN for the purpose of hostname verification. Reason: InvalidCerts_InvalidCertsDetected Status: False Type: Upgradeable Extension: <nil> After a few minutes ,Upgradeable should be changed from False to True, $ oc describe co kube-apiserver | grep -B5 'Extension:' Last Transition Time: 2022-03-16T09:24:29Z Message: KubeletMinorVersionUpgradeable: Kubelet and API server minor versions are synced. Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> - Check prometheus UI, for apiserver_kube_aggregator_x509_missing_san_total matric value, and wait more than 5m, check it again, the count still keeps the same results without any changes. apiserver_kube_aggregator_x509_missing_san_total kube-apiserver https 10.0.155.45:6443 apiserver default openshift-monitoring/k8s kubernetes 72 apiserver_kube_aggregator_x509_missing_san_total kube-apiserver https 10.0.171.167:6443 apiserver default openshift-monitoring/k8s kubernetes 111 apiserver_kube_aggregator_x509_missing_san_total kube-apiserver https 10.0.194.169:6443 apiserver default openshift-monitoring/k8s kubernetes 533 Based on above, the kube-apiserver verified the aggregator apiserver using cert w or w/o SAN as expected. Verified openshift_auth_x509_missing_san_total in 4.9.0-0.nightly-2022-03-15-215745 . To avoid the bug is further too crowded and lengthy, the verification steps are put in doc https://docs.google.com/document/d/1E0s2jp_zpeJGPOZ-iSBe1sYFvK1C659Ut-X0y7JpQ6c/edit#heading=h.aipr2g1orltn "Verification for openshift_auth_x509_missing_san_total" section. Conclusion: when SAN-less cert is used in IDP server, ' sum by (provider) (increase(openshift_auth_x509_missing_san_total{job="oauth-openshift",namespace="openshift-authentication"}[30m])) or on() vector(0) ' is positive and will trigger Upgradeable as False; when the IDP server is updated with SAN cert, and wait for 30mins, the query is down to zero and Upgradeable turns to True again. Note: the fix has a flaw, that is, even if we do nothing but just wait for 30 mins, the operator's Upgradeable can still turn to True. This should be a flaw that can't be fixed in theory. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.9.25 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0861 *** Bug 2054256 has been marked as a duplicate of this bug. *** The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |