Bug 1530312 - Redeploy openshift CA certificate fails via ansible installer
Summary: Redeploy openshift CA certificate fails via ansible installer
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.6.0
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: ---
: ---
Assignee: Andrew Butcher
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-02 14:30 UTC by Neeraj
Modified: 2018-01-17 19:18 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-01-17 19:18:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Neeraj 2018-01-02 14:30:35 UTC
Description of problem:
When tried to redeploy the openshift CA certificate. It failed.

Version-Release number of selected component (if applicable):

OCP 3.6

How reproducible:

Actual results:

Playbook failed and master api and controllers, node all services failed.


pzj openshift[97809]: Failed to dial ocp-master-w4ph:2379: connection error: desc = "transport: context canceled"; please retry.
Jan 02 08:39:33 ocp-master-3pzj openshift[97809]: Failed to dial ocp-master-3pzj:2379: grpc: the connection is closing; please retry.
Jan 02 08:39:33 ocp-master-3pzj openshift[97809]: Failed to dial ocp-master-3pzj:2379: connection error: desc = "transport: context canceled"; please retry.
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.135849   97809 reflector.go:187] Starting reflector *user.Group (2m0s) from github.com/openshift/origin/pkg/user/cache/groups.go:54
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.135907   97809 run_components.go:91] Using default project node label selector: role=worker
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.136199   97809 reflector.go:198] Starting reflector *authorization.ClusterPolicyBinding (10m0s) from github.com/openshift/origin/pkg/authorization/generated/informers/internalversion/factory.go:45
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.136267   97809 reflector.go:236] Listing and watching *authorization.ClusterPolicyBinding from github.com/openshift/origin/pkg/authorization/generated/informers/internalversion/factory.go:45
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.136467   97809 round_trippers.go:383] GET https://ocp-master-3pzj/apis/authorization.openshift.io/v1/clusterpolicybindings?resourceVersion=0
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.136482   97809 round_trippers.go:390] Request Headers:
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.136495   97809 round_trippers.go:393]     User-Agent: openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/fff65cf
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.136505   97809 round_trippers.go:393]     Accept: application/vnd.kubernetes.protobuf,application/json
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.136956   97809 reflector.go:198] Starting reflector *quota.ClusterResourceQuota (10m0s) from github.com/openshift/origin/pkg/quota/generated/informers/internalversion/factory.go:45
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.136998   97809 reflector.go:236] Listing and watching *quota.ClusterResourceQuota from github.com/openshift/origin/pkg/quota/generated/informers/internalversion/factory.go:45
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137289   97809 clusterquotamapping.go:160] Starting ClusterQuotaMappingController controller
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137340   97809 reflector.go:236] Listing and watching *user.Group from github.com/openshift/origin/pkg/user/cache/groups.go:54
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137634   97809 reflector.go:198] Starting reflector *authorization.ClusterPolicy (10m0s) from github.com/openshift/origin/pkg/authorization/generated/informers/internalversion/factory.go:45
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137685   97809 reflector.go:236] Listing and watching *authorization.ClusterPolicy from github.com/openshift/origin/pkg/authorization/generated/informers/internalversion/factory.go:45
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137800   97809 reflector.go:198] Starting reflector *authorization.Policy (10m0s) from github.com/openshift/origin/pkg/authorization/generated/informers/internalversion/factory.go:45
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137808   97809 round_trippers.go:383] GET https://ocp-master-3pzj/apis/authorization.openshift.io/v1/clusterpolicies?resourceVersion=0
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137843   97809 round_trippers.go:390] Request Headers:
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137844   97809 reflector.go:236] Listing and watching *authorization.Policy from github.com/openshift/origin/pkg/authorization/generated/informers/internalversion/factory.go:45
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137854   97809 round_trippers.go:393]     User-Agent: openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/fff65cf
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137865   97809 round_trippers.go:393]     Accept: application/vnd.kubernetes.protobuf,application/json
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137978   97809 round_trippers.go:383] GET https://ocp-master-3pzj/apis/authorization.openshift.io/v1/policies?resourceVersion=0
Jan 02 08:39:33 ocp-master-3pzj openshift[97809]: Failed to dial ocp-master-w4ph:2379: connection error: desc = "transport: context canceled"; please retry.
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137995   97809 round_trippers.go:390] Request Headers:
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138006   97809 round_trippers.go:393]     Accept: application/vnd.kubernetes.protobuf,application/json
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138016   97809 round_trippers.go:393]     User-Agent: openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/fff65cf
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138320   97809 reflector.go:198] Starting reflector *authorization.PolicyBinding (10m0s) from github.com/openshift/origin/pkg/authorization/generated/informers/internalversion/factory.go:45
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138353   97809 reflector.go:236] Listing and watching *authorization.PolicyBinding from github.com/openshift/origin/pkg/authorization/generated/informers/internalversion/factory.go:45
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138465   97809 round_trippers.go:383] GET https://ocp-master-3pzj/apis/quota.openshift.io/v1/clusterresourcequotas?resourceVersion=0
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138480   97809 round_trippers.go:390] Request Headers:
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138489   97809 round_trippers.go:393]     Accept: application/vnd.kubernetes.protobuf,application/json
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138498   97809 round_trippers.go:393]     User-Agent: openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/fff65cf
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138604   97809 round_trippers.go:408] Response Status:  in 2 milliseconds
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138616   97809 round_trippers.go:411] Response Headers:
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138732   97809 round_trippers.go:383] GET https://ocp-master-3pzj/apis/authorization.openshift.io/v1/policybindings?resourceVersion=0
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138749   97809 round_trippers.go:390] Request Headers:
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138759   97809 round_trippers.go:393]     Accept: application/vnd.kubernetes.protobuf,application/json
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138768   97809 round_trippers.go:393]     User-Agent: openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/fff65cf
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.139135   97809 round_trippers.go:408] Response Status:  in 1 milliseconds
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.139149   97809 round_trippers.go:411] Response Headers:
Jan 02 08:39:33 ocp-master-3pzj openshift[97809]: Failed to dial ocp-master-h8sg:2379: connection error: desc = "transport: context canceled"; please retry.
Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.139732   97809 round_trippers.go:408] Response Status:  in 1 milliseconds
Jan 02 08:39:34 ocp-master-3pzj systemd[1]: Started Atomic OpenShift Master API.
-- Subject: Unit atomic-openshift-master-api.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit atomic-openshift-master-api.service has finished starting up.
-- 
-- The start-up result is done.
Jan 02 08:39:34 ocp-master-3pzj systemd[1]: atomic-openshift-master-api.service: main process exited, code=exited, status=255/n/a
Jan 02 08:39:34 ocp-master-3pzj systemd[1]: Unit atomic-openshift-master-api.service entered failed state.
Jan 02 08:39:34 ocp-master-3pzj systemd[1]: atomic-openshift-master-api.service failed.
Jan 02 08:39:39 ocp-master-3pzj systemd[1]: atomic-openshift-master-api.service holdoff time over, scheduling restart.
Jan 02 08:39:39 ocp-master-3pzj systemd[1]: Starting Atomic OpenShift Master API...
-- Subject: Unit atomic-openshift-master-api.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit atomic-openshift-master-api.service has begun starting up.


--------------------------------------Controllers-------------------------

kube-system/endpoints/openshift-controller-manager: dial tcp 10.140.0.19:443: getsockopt: connection refused
Jan 02 08:58:30 ocp-master-3pzj atomic-openshift-master-controllers[93740]: E0102 08:58:30.239309   93740 leaderelection.go:124] unable to confirm openshift-controller-manager lease exists: Get https://ocp-master-3pzj/api/v1/namespaces/kube-system/endpoints/openshift-controller-manager: dial tcp 10.140.0.19:443: getsockopt: connection refused
Jan 02 08:58:45 ocp-master-3pzj atomic-openshift-master-controllers[93740]: E0102 08:58:45.239454   93740 leaderelection.go:124] unable to confirm openshift-controller-manager lease exists: Get https://ocp-master-3pzj/api/v1/namespaces/kube-system/endpoints/openshift-controller-manager: dial tcp 10.140.0.19:443: getsockopt: connection refused
Jan 02 08:59:00 ocp-master-3pzj atomic-openshift-master-controllers[93740]: E0102 08:59:00.239421   93740 leaderelection.go:124] unable to confirm openshift-controller-manager lease exists: Get https://ocp-master-3pzj/api/v1/namespaces/kube-system/endpoints/openshift-controller-manager: dial tcp 10.140.0.19:443: getsockopt: connection refused
Jan 02 08:59:15 ocp-master-3pzj atomic-openshift-master-controllers[93740]: E0102 08:59:15.239571   93740 leaderelection.go:124] unable to confirm openshift-controller-manager lease exists: Get https://ocp-master-3pzj/api/v1/namespaces/kube-system/endpoints/openshift-controller-manager: dial tcp 10.140.0.19:443: getsockopt: connection refused
Jan 02 08:59:30 ocp-master-3pzj atomic-openshift-master-controllers[93740]: E0102 08:59:30.239565   93740 leaderelection.go:124] unable to confirm openshift-controller-manager lease exists: Get https://ocp-master-3pzj/api/v1/namespaces/kube-system/endpoints/openshift-controller-manager: dial tcp 10.140.0.19:443: getsockopt: connection refused




Expected results:

Runs successfully.

Comment 3 Scott Dodson 2018-01-02 14:45:50 UTC
The error from etcd client is pretty generic and I've seen this happen when etcd runs out of quota space. Can you verify that there are no alarms tripped?

On one of their etcd hosts run this

etcdctl3 alarm list

If it shows an alarm they need to increase their quota size, add this to /etc/etcd/etcd.conf on each host and restart etcd

ETCD_QUOTA_BACKEND_BYTES=4294967296

Then clear the alarms

etcdctl3 alarm disarm

If no alarms are shown then this isn't the problem. I'm still reviewing the attached logs for other problems.

Comment 5 Scott Dodson 2018-01-02 21:29:42 UTC
After reviewing some of the attached sosreports we believe that the CA that was originally used to sign internal components has been removed and replaced, either manually or errantly by the playbooks. Does the customer still have access to the CA certificate that was used to sign /etc/origin/master/master.server.crt and other certs like /etc/origin/node/system:node:ocp-master-3pzj.crt ? If so we'd advise them to append that certificate to /etc/origin/master/ca-bundle.crt and /etc/origin/node/ca.crt on all hosts and restart services

Here we examine the issuer for these certs

$ openssl x509 -text -in /etc/origin/node/system\:node\:ocp-master-3pzj.crt | grep Issuer
        Issuer: C=IN, L=Bangalore, O=Wipro Ltd, OU=WBPO, CN=www.deltaverge.com

$ openssl x509 -text -in /etc/origin/master/master.server.crt | grep Issue
        Issuer: C=IN, L=Bangalore, O=Wipro Ltd, OU=WBPO, CN=www.deltaverge.com

The old CA may have been preserved in /etc/origin/master/legacy-ca/ where we preserve a copy of all CA certificates before replacing them, however if that were the case we'd expect that it should've been appended to the CA bundle in /etc/origin/master/ca-bundle.crt

You can verify that the CA matches the certificate like this, below you see an error

$ openssl verify -CAfile /etc/origin/master/ca-bundle.crt /etc/origin/master/master.server.crt 
master.server.crt: CN = 10.140.0.19
error 20 at 0 depth lookup:unable to get local issuer certificate

Here's what successful verification looks like, your etcd certs look fine.

$ openssl verify -CAfile /etc/etcd/ca.crt /etc/etcd/server.crt 
server.crt: OK

Once you find the cert that verifies /etc/origin/master/master.server.crt append it to /etc/origin/master/ca-bundle.crt and /etc/origin/node/ca.crt and restart services.

Comment 8 Andrew Butcher 2018-01-03 22:38:51 UTC
The CA certificate found in the sosreport does not include usage for certificate signing which means that it cannot be used for certificate signing. Examining the existing CA certificate key usage from the sosreport with openssl we see that “Certificate Sign” is not present within usage and basic constraints indicates that CA:FALSE which means that this is not a CA certificate.

$ openssl x509 -in ./sosreport/etc/origin/master/ca.crt -noout -text
...
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Basic Constraints:
                CA:FALSE
...

Compare that to an openshift-ansible generated OpenShift CA certificate:

$ openssl x509 -in /etc/origin/master/ca.crt -noout -text
...
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment, Certificate Sign
            X509v3 Basic Constraints: critical
                CA:TRUE
...

Additionally, the current CA certificate is issued by an intermediate Digicert CA which means that our bundle must contain the intermediate and root certificates for the CA we use in order to validate child certificates. openshift-ansible does not currently include a mechanism for specifying additional CA certificates to include in the bundle so using a custom CA certificate with an extended chain may have to be accomplished manually in part. We are working to verify steps for using a custom CA certificate with an extended chain with openshift-ansible.

From this point we can either use a different CA certificate that has been verified to support signing to recreate all cluster certificates while also ensuring that the intermediate and root certificates for that different CA certificate are present in the CA bundle OR generate a new CA certificate using openshift-ansible and create internal certificates using the newly generated CA.

To generate a new OpenShift CA with openshift-ansible, we can ensure that openshift_master_ca_certificate is unset in the inventory and then run the redeploy-openshift-ca.yml playbook (/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/redeploy-openshift-ca.yml). The existing cluster certificates will not allow us to restart services within the redeploy-openshift-ca.yml playbook so we should skip those service restart steps by commenting out service restart tasks within the playbook. These two service restart blocks must be entirely commented out within the linked playbook on disk within /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/redeploy-certificates/openshift-ca.yml.

Master service restart: https://github.com/openshift/openshift-ansible/blob/release-3.6/playbooks/common/openshift-cluster/redeploy-certificates/openshift-ca.yml#L211-L227

Node service restart: https://github.com/openshift/openshift-ansible/blob/release-3.6/playbooks/common/openshift-cluster/redeploy-certificates/openshift-ca.yml#L276-L301

Once the CA has been re-generated and distributed we can generate new cluster certificates by running the redeploy-certificates.yml playbook (/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/redeploy-certificates.yml ). Services should now be able to restart as this playbook restarts services after all certificates have been replaced.



Also note that the redeploy-certificates.yml playbook creates tar archives of /etc/origin/master within /etc/origin for each master before creating new certificates. The first master host will also archive generated certificates (/etc/origin/generated-configs) which is where certificates are created before being distributed to masters other than the first master as well as all nodes.

These archives can be found within /etc/origin/master-node-cert-config-backup-{{ ansible_date_time.epoch }}.tgz and the oldest archive should contain the original CA certificate as well as all of the original certificates for the cluster within folders named after the hosts. Note that these archives will only exist if the redeploy-certificates.yml playbook has been ran. The redeploy-openshift-ca.yml playbook simply copies all previous CA artifacts to the /etc/origin/master/legacy-ca directory so that all previous CA certificates may be included in the CA bundle.

Comment 10 Andrew Butcher 2018-01-04 20:42:06 UTC
(In reply to Andrew Butcher from comment #8)
> Additionally, the current CA certificate is issued by an intermediate
> Digicert CA which means that our bundle must contain the intermediate and
> root certificates for the CA we use in order to validate child certificates.
> openshift-ansible does not currently include a mechanism for specifying
> additional CA certificates to include in the bundle so using a custom CA
> certificate with an extended chain may have to be accomplished manually in
> part. We are working to verify steps for using a custom CA certificate with
> an extended chain with openshift-ansible.

In order to use an intermediate CA certificate with openshift-ansible, the CA certificate supplied as the openshift_master_ca_certificate must contain the full chain.

To test this, I created an intermediate CA using Jamie Nguyen's guide [1] but created the keys without passphrases. I combined the intermediate CA certificate and the root CA certificate into a single file beginning with the intermediate CA certificate and used the full chain as my openshift_master_ca_certificate.

For example:

$ cat intermediate/certs/intermediate.cert.pem \
      certs/ca.cert.pem > intermediate/certs/ca-chain.cert.pem


openshift_master_ca_certificate={'certfile': '/home/abutcher/ca/intermediate/certs/ca-chain.cert.pem', 'keyfile': '/home/abutcher/ca/intermediate/private/intermediate.key.pem'}


Before running openshift-ansible the CA can be tested with oc by trying to create a certificate. Running the first command here will create a test certificate and key in /tmp/. The second command verifies the testing certificate using the existing CA bundle.

$ oc adm ca create-server-cert \
     --signer-cert=/root/ca-chain.cert.pem \
     --signer-key=/root/intermediate.key.pem
     --signer-serial=/root/ca.serial.txt \
     --hostnames="testing.example.com" \
     --cert=/tmp/testing.crt \
     --key=/tmp/testing.key

$ openssl verify -CAfile /root/ca-chain.cert.pem /tmp/testing.crt 
/tmp/testing.crt: OK

My resultant cluster certificates are signed by my intermediate CA certificate and can be verified with the CA bundle.

$ openssl x509 -in /etc/origin/master/ca.crt -noout -text
Certificate:
...
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: C=US, ST=North Carolina, O=Flibjib, CN=Flibjib Root
        Validity
            Not Before: Jan  4 20:02:48 2018 GMT
            Not After : Jan  2 20:02:48 2028 GMT
        Subject: C=US, ST=North Carolina, O=Flibjib, CN=Flibjib Intermediate
...


$ openssl x509 -in /etc/origin/master/master.server.crt -noout -text
Certificate:
...
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: C=US, ST=North Carolina, O=Flibjib, CN=Flibjib Intermediate
        Validity
            Not Before: Jan  4 20:17:52 2018 GMT
            Not After : Jan  4 20:17:53 2020 GMT
...


$ openssl verify -CAfile /etc/origin/master/ca-bundle.crt /etc/origin/master/master.server.crt 
/etc/origin/master/master.server.crt: OK


[1] https://jamielinux.com/docs/openssl-certificate-authority/index.html

Comment 12 Scott Dodson 2018-01-17 19:18:59 UTC
This was the result of using an invalid certificate authority to re-sign the cluster. Closing notabug.


Note You need to log in before you can comment on or make changes to this bug.