Description of problem: When tried to redeploy the openshift CA certificate. It failed. Version-Release number of selected component (if applicable): OCP 3.6 How reproducible: Actual results: Playbook failed and master api and controllers, node all services failed. pzj openshift[97809]: Failed to dial ocp-master-w4ph:2379: connection error: desc = "transport: context canceled"; please retry. Jan 02 08:39:33 ocp-master-3pzj openshift[97809]: Failed to dial ocp-master-3pzj:2379: grpc: the connection is closing; please retry. Jan 02 08:39:33 ocp-master-3pzj openshift[97809]: Failed to dial ocp-master-3pzj:2379: connection error: desc = "transport: context canceled"; please retry. Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.135849 97809 reflector.go:187] Starting reflector *user.Group (2m0s) from github.com/openshift/origin/pkg/user/cache/groups.go:54 Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.135907 97809 run_components.go:91] Using default project node label selector: role=worker Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.136199 97809 reflector.go:198] Starting reflector *authorization.ClusterPolicyBinding (10m0s) from github.com/openshift/origin/pkg/authorization/generated/informers/internalversion/factory.go:45 Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.136267 97809 reflector.go:236] Listing and watching *authorization.ClusterPolicyBinding from github.com/openshift/origin/pkg/authorization/generated/informers/internalversion/factory.go:45 Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.136467 97809 round_trippers.go:383] GET https://ocp-master-3pzj/apis/authorization.openshift.io/v1/clusterpolicybindings?resourceVersion=0 Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.136482 97809 round_trippers.go:390] Request Headers: Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.136495 97809 round_trippers.go:393] User-Agent: openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/fff65cf Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.136505 97809 round_trippers.go:393] Accept: application/vnd.kubernetes.protobuf,application/json Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.136956 97809 reflector.go:198] Starting reflector *quota.ClusterResourceQuota (10m0s) from github.com/openshift/origin/pkg/quota/generated/informers/internalversion/factory.go:45 Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.136998 97809 reflector.go:236] Listing and watching *quota.ClusterResourceQuota from github.com/openshift/origin/pkg/quota/generated/informers/internalversion/factory.go:45 Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137289 97809 clusterquotamapping.go:160] Starting ClusterQuotaMappingController controller Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137340 97809 reflector.go:236] Listing and watching *user.Group from github.com/openshift/origin/pkg/user/cache/groups.go:54 Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137634 97809 reflector.go:198] Starting reflector *authorization.ClusterPolicy (10m0s) from github.com/openshift/origin/pkg/authorization/generated/informers/internalversion/factory.go:45 Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137685 97809 reflector.go:236] Listing and watching *authorization.ClusterPolicy from github.com/openshift/origin/pkg/authorization/generated/informers/internalversion/factory.go:45 Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137800 97809 reflector.go:198] Starting reflector *authorization.Policy (10m0s) from github.com/openshift/origin/pkg/authorization/generated/informers/internalversion/factory.go:45 Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137808 97809 round_trippers.go:383] GET https://ocp-master-3pzj/apis/authorization.openshift.io/v1/clusterpolicies?resourceVersion=0 Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137843 97809 round_trippers.go:390] Request Headers: Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137844 97809 reflector.go:236] Listing and watching *authorization.Policy from github.com/openshift/origin/pkg/authorization/generated/informers/internalversion/factory.go:45 Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137854 97809 round_trippers.go:393] User-Agent: openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/fff65cf Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137865 97809 round_trippers.go:393] Accept: application/vnd.kubernetes.protobuf,application/json Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137978 97809 round_trippers.go:383] GET https://ocp-master-3pzj/apis/authorization.openshift.io/v1/policies?resourceVersion=0 Jan 02 08:39:33 ocp-master-3pzj openshift[97809]: Failed to dial ocp-master-w4ph:2379: connection error: desc = "transport: context canceled"; please retry. Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.137995 97809 round_trippers.go:390] Request Headers: Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138006 97809 round_trippers.go:393] Accept: application/vnd.kubernetes.protobuf,application/json Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138016 97809 round_trippers.go:393] User-Agent: openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/fff65cf Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138320 97809 reflector.go:198] Starting reflector *authorization.PolicyBinding (10m0s) from github.com/openshift/origin/pkg/authorization/generated/informers/internalversion/factory.go:45 Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138353 97809 reflector.go:236] Listing and watching *authorization.PolicyBinding from github.com/openshift/origin/pkg/authorization/generated/informers/internalversion/factory.go:45 Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138465 97809 round_trippers.go:383] GET https://ocp-master-3pzj/apis/quota.openshift.io/v1/clusterresourcequotas?resourceVersion=0 Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138480 97809 round_trippers.go:390] Request Headers: Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138489 97809 round_trippers.go:393] Accept: application/vnd.kubernetes.protobuf,application/json Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138498 97809 round_trippers.go:393] User-Agent: openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/fff65cf Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138604 97809 round_trippers.go:408] Response Status: in 2 milliseconds Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138616 97809 round_trippers.go:411] Response Headers: Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138732 97809 round_trippers.go:383] GET https://ocp-master-3pzj/apis/authorization.openshift.io/v1/policybindings?resourceVersion=0 Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138749 97809 round_trippers.go:390] Request Headers: Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138759 97809 round_trippers.go:393] Accept: application/vnd.kubernetes.protobuf,application/json Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.138768 97809 round_trippers.go:393] User-Agent: openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/fff65cf Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.139135 97809 round_trippers.go:408] Response Status: in 1 milliseconds Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.139149 97809 round_trippers.go:411] Response Headers: Jan 02 08:39:33 ocp-master-3pzj openshift[97809]: Failed to dial ocp-master-h8sg:2379: connection error: desc = "transport: context canceled"; please retry. Jan 02 08:39:33 ocp-master-3pzj atomic-openshift-master-api[97809]: I0102 08:39:33.139732 97809 round_trippers.go:408] Response Status: in 1 milliseconds Jan 02 08:39:34 ocp-master-3pzj systemd[1]: Started Atomic OpenShift Master API. -- Subject: Unit atomic-openshift-master-api.service has finished start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit atomic-openshift-master-api.service has finished starting up. -- -- The start-up result is done. Jan 02 08:39:34 ocp-master-3pzj systemd[1]: atomic-openshift-master-api.service: main process exited, code=exited, status=255/n/a Jan 02 08:39:34 ocp-master-3pzj systemd[1]: Unit atomic-openshift-master-api.service entered failed state. Jan 02 08:39:34 ocp-master-3pzj systemd[1]: atomic-openshift-master-api.service failed. Jan 02 08:39:39 ocp-master-3pzj systemd[1]: atomic-openshift-master-api.service holdoff time over, scheduling restart. Jan 02 08:39:39 ocp-master-3pzj systemd[1]: Starting Atomic OpenShift Master API... -- Subject: Unit atomic-openshift-master-api.service has begun start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit atomic-openshift-master-api.service has begun starting up. --------------------------------------Controllers------------------------- kube-system/endpoints/openshift-controller-manager: dial tcp 10.140.0.19:443: getsockopt: connection refused Jan 02 08:58:30 ocp-master-3pzj atomic-openshift-master-controllers[93740]: E0102 08:58:30.239309 93740 leaderelection.go:124] unable to confirm openshift-controller-manager lease exists: Get https://ocp-master-3pzj/api/v1/namespaces/kube-system/endpoints/openshift-controller-manager: dial tcp 10.140.0.19:443: getsockopt: connection refused Jan 02 08:58:45 ocp-master-3pzj atomic-openshift-master-controllers[93740]: E0102 08:58:45.239454 93740 leaderelection.go:124] unable to confirm openshift-controller-manager lease exists: Get https://ocp-master-3pzj/api/v1/namespaces/kube-system/endpoints/openshift-controller-manager: dial tcp 10.140.0.19:443: getsockopt: connection refused Jan 02 08:59:00 ocp-master-3pzj atomic-openshift-master-controllers[93740]: E0102 08:59:00.239421 93740 leaderelection.go:124] unable to confirm openshift-controller-manager lease exists: Get https://ocp-master-3pzj/api/v1/namespaces/kube-system/endpoints/openshift-controller-manager: dial tcp 10.140.0.19:443: getsockopt: connection refused Jan 02 08:59:15 ocp-master-3pzj atomic-openshift-master-controllers[93740]: E0102 08:59:15.239571 93740 leaderelection.go:124] unable to confirm openshift-controller-manager lease exists: Get https://ocp-master-3pzj/api/v1/namespaces/kube-system/endpoints/openshift-controller-manager: dial tcp 10.140.0.19:443: getsockopt: connection refused Jan 02 08:59:30 ocp-master-3pzj atomic-openshift-master-controllers[93740]: E0102 08:59:30.239565 93740 leaderelection.go:124] unable to confirm openshift-controller-manager lease exists: Get https://ocp-master-3pzj/api/v1/namespaces/kube-system/endpoints/openshift-controller-manager: dial tcp 10.140.0.19:443: getsockopt: connection refused Expected results: Runs successfully.
The error from etcd client is pretty generic and I've seen this happen when etcd runs out of quota space. Can you verify that there are no alarms tripped? On one of their etcd hosts run this etcdctl3 alarm list If it shows an alarm they need to increase their quota size, add this to /etc/etcd/etcd.conf on each host and restart etcd ETCD_QUOTA_BACKEND_BYTES=4294967296 Then clear the alarms etcdctl3 alarm disarm If no alarms are shown then this isn't the problem. I'm still reviewing the attached logs for other problems.
After reviewing some of the attached sosreports we believe that the CA that was originally used to sign internal components has been removed and replaced, either manually or errantly by the playbooks. Does the customer still have access to the CA certificate that was used to sign /etc/origin/master/master.server.crt and other certs like /etc/origin/node/system:node:ocp-master-3pzj.crt ? If so we'd advise them to append that certificate to /etc/origin/master/ca-bundle.crt and /etc/origin/node/ca.crt on all hosts and restart services Here we examine the issuer for these certs $ openssl x509 -text -in /etc/origin/node/system\:node\:ocp-master-3pzj.crt | grep Issuer Issuer: C=IN, L=Bangalore, O=Wipro Ltd, OU=WBPO, CN=www.deltaverge.com $ openssl x509 -text -in /etc/origin/master/master.server.crt | grep Issue Issuer: C=IN, L=Bangalore, O=Wipro Ltd, OU=WBPO, CN=www.deltaverge.com The old CA may have been preserved in /etc/origin/master/legacy-ca/ where we preserve a copy of all CA certificates before replacing them, however if that were the case we'd expect that it should've been appended to the CA bundle in /etc/origin/master/ca-bundle.crt You can verify that the CA matches the certificate like this, below you see an error $ openssl verify -CAfile /etc/origin/master/ca-bundle.crt /etc/origin/master/master.server.crt master.server.crt: CN = 10.140.0.19 error 20 at 0 depth lookup:unable to get local issuer certificate Here's what successful verification looks like, your etcd certs look fine. $ openssl verify -CAfile /etc/etcd/ca.crt /etc/etcd/server.crt server.crt: OK Once you find the cert that verifies /etc/origin/master/master.server.crt append it to /etc/origin/master/ca-bundle.crt and /etc/origin/node/ca.crt and restart services.
The CA certificate found in the sosreport does not include usage for certificate signing which means that it cannot be used for certificate signing. Examining the existing CA certificate key usage from the sosreport with openssl we see that “Certificate Sign” is not present within usage and basic constraints indicates that CA:FALSE which means that this is not a CA certificate. $ openssl x509 -in ./sosreport/etc/origin/master/ca.crt -noout -text ... X509v3 Key Usage: critical Digital Signature, Key Encipherment X509v3 Basic Constraints: CA:FALSE ... Compare that to an openshift-ansible generated OpenShift CA certificate: $ openssl x509 -in /etc/origin/master/ca.crt -noout -text ... X509v3 Key Usage: critical Digital Signature, Key Encipherment, Certificate Sign X509v3 Basic Constraints: critical CA:TRUE ... Additionally, the current CA certificate is issued by an intermediate Digicert CA which means that our bundle must contain the intermediate and root certificates for the CA we use in order to validate child certificates. openshift-ansible does not currently include a mechanism for specifying additional CA certificates to include in the bundle so using a custom CA certificate with an extended chain may have to be accomplished manually in part. We are working to verify steps for using a custom CA certificate with an extended chain with openshift-ansible. From this point we can either use a different CA certificate that has been verified to support signing to recreate all cluster certificates while also ensuring that the intermediate and root certificates for that different CA certificate are present in the CA bundle OR generate a new CA certificate using openshift-ansible and create internal certificates using the newly generated CA. To generate a new OpenShift CA with openshift-ansible, we can ensure that openshift_master_ca_certificate is unset in the inventory and then run the redeploy-openshift-ca.yml playbook (/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/redeploy-openshift-ca.yml). The existing cluster certificates will not allow us to restart services within the redeploy-openshift-ca.yml playbook so we should skip those service restart steps by commenting out service restart tasks within the playbook. These two service restart blocks must be entirely commented out within the linked playbook on disk within /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/redeploy-certificates/openshift-ca.yml. Master service restart: https://github.com/openshift/openshift-ansible/blob/release-3.6/playbooks/common/openshift-cluster/redeploy-certificates/openshift-ca.yml#L211-L227 Node service restart: https://github.com/openshift/openshift-ansible/blob/release-3.6/playbooks/common/openshift-cluster/redeploy-certificates/openshift-ca.yml#L276-L301 Once the CA has been re-generated and distributed we can generate new cluster certificates by running the redeploy-certificates.yml playbook (/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/redeploy-certificates.yml ). Services should now be able to restart as this playbook restarts services after all certificates have been replaced. Also note that the redeploy-certificates.yml playbook creates tar archives of /etc/origin/master within /etc/origin for each master before creating new certificates. The first master host will also archive generated certificates (/etc/origin/generated-configs) which is where certificates are created before being distributed to masters other than the first master as well as all nodes. These archives can be found within /etc/origin/master-node-cert-config-backup-{{ ansible_date_time.epoch }}.tgz and the oldest archive should contain the original CA certificate as well as all of the original certificates for the cluster within folders named after the hosts. Note that these archives will only exist if the redeploy-certificates.yml playbook has been ran. The redeploy-openshift-ca.yml playbook simply copies all previous CA artifacts to the /etc/origin/master/legacy-ca directory so that all previous CA certificates may be included in the CA bundle.
(In reply to Andrew Butcher from comment #8) > Additionally, the current CA certificate is issued by an intermediate > Digicert CA which means that our bundle must contain the intermediate and > root certificates for the CA we use in order to validate child certificates. > openshift-ansible does not currently include a mechanism for specifying > additional CA certificates to include in the bundle so using a custom CA > certificate with an extended chain may have to be accomplished manually in > part. We are working to verify steps for using a custom CA certificate with > an extended chain with openshift-ansible. In order to use an intermediate CA certificate with openshift-ansible, the CA certificate supplied as the openshift_master_ca_certificate must contain the full chain. To test this, I created an intermediate CA using Jamie Nguyen's guide [1] but created the keys without passphrases. I combined the intermediate CA certificate and the root CA certificate into a single file beginning with the intermediate CA certificate and used the full chain as my openshift_master_ca_certificate. For example: $ cat intermediate/certs/intermediate.cert.pem \ certs/ca.cert.pem > intermediate/certs/ca-chain.cert.pem openshift_master_ca_certificate={'certfile': '/home/abutcher/ca/intermediate/certs/ca-chain.cert.pem', 'keyfile': '/home/abutcher/ca/intermediate/private/intermediate.key.pem'} Before running openshift-ansible the CA can be tested with oc by trying to create a certificate. Running the first command here will create a test certificate and key in /tmp/. The second command verifies the testing certificate using the existing CA bundle. $ oc adm ca create-server-cert \ --signer-cert=/root/ca-chain.cert.pem \ --signer-key=/root/intermediate.key.pem --signer-serial=/root/ca.serial.txt \ --hostnames="testing.example.com" \ --cert=/tmp/testing.crt \ --key=/tmp/testing.key $ openssl verify -CAfile /root/ca-chain.cert.pem /tmp/testing.crt /tmp/testing.crt: OK My resultant cluster certificates are signed by my intermediate CA certificate and can be verified with the CA bundle. $ openssl x509 -in /etc/origin/master/ca.crt -noout -text Certificate: ... Signature Algorithm: sha256WithRSAEncryption Issuer: C=US, ST=North Carolina, O=Flibjib, CN=Flibjib Root Validity Not Before: Jan 4 20:02:48 2018 GMT Not After : Jan 2 20:02:48 2028 GMT Subject: C=US, ST=North Carolina, O=Flibjib, CN=Flibjib Intermediate ... $ openssl x509 -in /etc/origin/master/master.server.crt -noout -text Certificate: ... Signature Algorithm: sha256WithRSAEncryption Issuer: C=US, ST=North Carolina, O=Flibjib, CN=Flibjib Intermediate Validity Not Before: Jan 4 20:17:52 2018 GMT Not After : Jan 4 20:17:53 2020 GMT ... $ openssl verify -CAfile /etc/origin/master/ca-bundle.crt /etc/origin/master/master.server.crt /etc/origin/master/master.server.crt: OK [1] https://jamielinux.com/docs/openssl-certificate-authority/index.html
This was the result of using an invalid certificate authority to re-sign the cluster. Closing notabug.