Bug 1652746
Summary: | Running redeploy CA and openshift redeploy certificates , causes all nodes to become notready. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ryan Howe <rhowe> | ||||||||
Component: | Installer | Assignee: | Joseph Callen <jcallen> | ||||||||
Installer sub component: | openshift-ansible | QA Contact: | Gaoyun Pei <gpei> | ||||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||||
Severity: | high | ||||||||||
Priority: | unspecified | CC: | benhamid, bleanhar, gpei, jdesousa, jjerezro, mirollin, pdwyer, redhat, syangsao, tkimura | ||||||||
Version: | 3.11.0 | ||||||||||
Target Milestone: | --- | ||||||||||
Target Release: | 3.11.z | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: |
Cause: When a new CA was generated the certificates on the nodes were not updated.
Consequence: Nodes would become not ready
Fix: Add to the redeploy-certificates playbook existing playbooks that will bootstrap, copy certificates and join nodes.
Result: Nodes no longer go NotReady when replacing CA.
|
Story Points: | --- | ||||||||
Clone Of: | |||||||||||
: | 1699467 (view as bug list) | Environment: | |||||||||
Last Closed: | 2019-06-26 09:07:54 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1699467 | ||||||||||
Attachments: |
|
Description
Ryan Howe
2018-11-22 19:41:13 UTC
Created attachment 1508099 [details]
after-master-redeploy-openshift-ca.txt
Created attachment 1508100 [details]
after-redeploy-cert.txt
To fix this all that needs to happen is to create a new bootstrap.kubeconfig , disturbed them, remove old certs, restart nodes, and approve the CSRs. 1. Create a new bootstrap.kubeconfig # oc serviceaccounts create-kubeconfig node-bootstrapper -n openshift-infra > /etc/origin/node/bootstrap.kubeconfig 2. Distribute it to nodes replacing /etc/origin/node/bootstrap.kubeconfig 3. Remove contents of /etc/origin/node/certificates # rm -rf /etc/origin/node/certificates 4. Restart node service 5. Approve CSRs # oc get csr -o name | xargs oc adm certificate approve --------------------------------------------------- --------------------------------------------------- Example fix maybe: - add this play to playbooks/redeploy-certificates.yml ``` - playbooks/openshift-master/private/tasks/enable_bootstrap.yml - NAME_OF_NEW_PLAY ``` - then a new play: ``` - name: Remove old node certificates hosts: oo_nodes_to_config gather_facts: node tasks: - file: path: /etc/origin/node/certificates state: absent - name: Distribute bootstrap and start nodes hosts: oo_nodes_to_config gather_facts: no tasks: - import_role: name: openshift_node tasks_from: distribute_bootstrap.yml - name: Approve any pending CSR requests from inventory nodes hosts: oo_first_master gather_facts: no tasks: - name: Dump all candidate bootstrap hostnames debug: msg: "{{ groups['oo_nodes_to_config'] | default([]) }}" - name: Find all hostnames for bootstrapping set_fact: l_nodes_to_join: "{{ groups['oo_nodes_to_config'] | default([]) | map('extract', hostvars) | map(attribute='l_kubelet_node_name') | list }}" - name: Dump the bootstrap hostnames debug: msg: "{{ l_nodes_to_join }}" - name: Approve node certificates when bootstrapping oc_csr_approve: oc_bin: "{{ openshift_client_binary }}" oc_conf: "{{ openshift.common.config_base }}/master/admin.kubeconfig" node_list: "{{ l_nodes_to_join }}" register: node_bootstrap_csr_approve retries: 30 until: node_bootstrap_csr_approve is succeeded when: - l_nodes_to_join|length > 0 ``` UPDATED MANUAL STEPS to replace certs on NODE. 1. Create a new bootstrap.kubeconfig for nodes (MASTER nodes will just copy admin.kubeconfig) # oc serviceaccounts create-kubeconfig node-bootstrapper -n openshift-infra > bootstrap.kubeconfig 1B. JUST ON THE MASTERS # cp /etc/origin/master/admin.kubeconfig /etc/origin/node/bootstrap.kubeconfig 2. Distribute configure create in step 1A to infra and compute nodes replacing /etc/origin/node/bootstrap.kubeconfig 3A. Remove contents of /etc/origin/node/certificates and move the node.kubeconfig and client-ca.crt # rm -rf /etc/origin/node/certificates # mv /etc/origin/node/client-ca.crt{,.old} # mv /etc/origin/node/node.kubeconfig{,.old} 4. Restart node service. # systemctl restart atomic-openshift-node.service 5. Approve CSRs. 2 should be approved. # oc get csr -o name | xargs oc adm certificate approve I believe there is a step missing in the workaround in comment #4: 6. Copy the bootstrap.kubeconfig created in step 1 to the following location in the master nodes: /etc/origin/master/bootstrap.kubeconfig We got hit by this bug and applied the workaround successfully, then tried to add new nodes to the cluster and they would'n join because the bootstrap.kubeconfig file that was being distributed ot the new nodes by the ansible playbooks was still the old one. Build: openshift-ansible-3.11.106-1 @Ryan, thanks for those steps, appreciated. That helped us get back on track. @Jose - So did we, with the same with being on the masters too. Also there are a number of secrets that need to be regenerated. All the "default-tokenxxxx" in each project, majority of secret Type: kubernetes.io/tls, especially in the openshift-monitoring project. Then bounce impacted pods to pick up the new secret that is re-generated by the controller. Verify this bug with openshift-ansible-3.11.117-1.git.0.add13ff.el7.noarch.rpm 1. Run playbooks/openshift-master/redeploy-openshift-ca.yml to redeploy OpenShift CA cert 2. With openshift_redeploy_openshift_ca=true added in inventory file, run playbooks/redeploy-certificates.yml to redeploy the OpenShift certs TASK [Remove generated certificates] ******************************************* changed: [ec2-52-90-122-145.compute-1.amazonaws.com] => (item=certificates) => {"changed": true, "item": "certificates", "path": "/etc/origin/node/certificates", "state": "absent"} changed: [ec2-3-81-234-184.compute-1.amazonaws.com] => (item=certificates) => {"changed": true, "item": "certificates", "path": "/etc/origin/node/certificates", "state": "absent"} changed: [ec2-3-90-244-172.compute-1.amazonaws.com] => (item=certificates) => {"changed": true, "item": "certificates", "path": "/etc/origin/node/certificates", "state": "absent"} changed: [ec2-52-90-122-145.compute-1.amazonaws.com] => (item=bootstrap.kubeconfig) => {"changed": true, "item": "bootstrap.kubeconfig", "path": "/etc/origin/node/bootstrap.kubeconfig", "state": "absent"} changed: [ec2-3-81-234-184.compute-1.amazonaws.com] => (item=bootstrap.kubeconfig) => {"changed": true, "item": "bootstrap.kubeconfig", "path": "/etc/origin/node/bootstrap.kubeconfig", "state": "absent"} changed: [ec2-3-90-244-172.compute-1.amazonaws.com] => (item=bootstrap.kubeconfig) => {"changed": true, "item": "bootstrap.kubeconfig", "path": "/etc/origin/node/bootstrap.kubeconfig", "state": "absent"} After redeployment, node certs and node bootstrap.kubeconfig get recreated. Node certs were issued by the new CA, bootstrap.kubeconfig get updated with new CA. All nodes are in Ready status and all pods are running, no pending csr, move this bug to VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1605 *** Bug 1734552 has been marked as a duplicate of this bug. *** I ran into this issue and managed to get my cluster in ready state by applying commits of these too PRs before running the redeploy certificates playbook: - https://github.com/openshift/openshift-ansible/pull/12235 - https://github.com/openshift/openshift-ansible/pull/12235/commits/6864ffe08de7c1d6a4caeeefe6182fa4d67817a0 |