Hide Forgot
Created attachment 1508098 [details] beforerun.txt Description of problem: After redeploying new CA and running redeploy-certificates, all nodes become notready due to certificates not getting rotated and bootstrap.kubeconfig not getting updated. Version-Release number of the following components: openshift-ansible-3.10.66 openshift-ansible-3.11.16 How reproducible: Steps to Reproduce: 1. ansible-playbook -i <inventory_file> /usr/share/ansible/openshift-ansible/playbooks/openshift-master/redeploy-openshift-ca.yml 2. ansible-playbook -i <inventory_file> /usr/share/ansible/openshift-ansible/playbooks/redeploy-certificates.yml Actual results: ca.crt gets updated and distributed to cluster. master certs are updated. node certs are not recreated node bootstrap.kubeconfig in not updated with new and client certs. Expected results: node certs to be recreated node bootstrap.kubeconfig to get updated with new CA and client certs Additional info: List certs # bash <(https://gist.githubusercontent.com/rjhowe/686e3729c2176b446f55dd6d6f1cdc22/raw/c24a774d7acd5cfce2895dd5fa246947b4c1b039/certcheck.sh) Before running any playbooks: beforerun.txt After CA redeploy playbook: after-master-redeploy-openshift-ca.txt After redeploy certs playbook after-redeploy-cert.txt
Created attachment 1508099 [details] after-master-redeploy-openshift-ca.txt
Created attachment 1508100 [details] after-redeploy-cert.txt
To fix this all that needs to happen is to create a new bootstrap.kubeconfig , disturbed them, remove old certs, restart nodes, and approve the CSRs. 1. Create a new bootstrap.kubeconfig # oc serviceaccounts create-kubeconfig node-bootstrapper -n openshift-infra > /etc/origin/node/bootstrap.kubeconfig 2. Distribute it to nodes replacing /etc/origin/node/bootstrap.kubeconfig 3. Remove contents of /etc/origin/node/certificates # rm -rf /etc/origin/node/certificates 4. Restart node service 5. Approve CSRs # oc get csr -o name | xargs oc adm certificate approve --------------------------------------------------- --------------------------------------------------- Example fix maybe: - add this play to playbooks/redeploy-certificates.yml ``` - playbooks/openshift-master/private/tasks/enable_bootstrap.yml - NAME_OF_NEW_PLAY ``` - then a new play: ``` - name: Remove old node certificates hosts: oo_nodes_to_config gather_facts: node tasks: - file: path: /etc/origin/node/certificates state: absent - name: Distribute bootstrap and start nodes hosts: oo_nodes_to_config gather_facts: no tasks: - import_role: name: openshift_node tasks_from: distribute_bootstrap.yml - name: Approve any pending CSR requests from inventory nodes hosts: oo_first_master gather_facts: no tasks: - name: Dump all candidate bootstrap hostnames debug: msg: "{{ groups['oo_nodes_to_config'] | default([]) }}" - name: Find all hostnames for bootstrapping set_fact: l_nodes_to_join: "{{ groups['oo_nodes_to_config'] | default([]) | map('extract', hostvars) | map(attribute='l_kubelet_node_name') | list }}" - name: Dump the bootstrap hostnames debug: msg: "{{ l_nodes_to_join }}" - name: Approve node certificates when bootstrapping oc_csr_approve: oc_bin: "{{ openshift_client_binary }}" oc_conf: "{{ openshift.common.config_base }}/master/admin.kubeconfig" node_list: "{{ l_nodes_to_join }}" register: node_bootstrap_csr_approve retries: 30 until: node_bootstrap_csr_approve is succeeded when: - l_nodes_to_join|length > 0 ```
UPDATED MANUAL STEPS to replace certs on NODE. 1. Create a new bootstrap.kubeconfig for nodes (MASTER nodes will just copy admin.kubeconfig) # oc serviceaccounts create-kubeconfig node-bootstrapper -n openshift-infra > bootstrap.kubeconfig 1B. JUST ON THE MASTERS # cp /etc/origin/master/admin.kubeconfig /etc/origin/node/bootstrap.kubeconfig 2. Distribute configure create in step 1A to infra and compute nodes replacing /etc/origin/node/bootstrap.kubeconfig 3A. Remove contents of /etc/origin/node/certificates and move the node.kubeconfig and client-ca.crt # rm -rf /etc/origin/node/certificates # mv /etc/origin/node/client-ca.crt{,.old} # mv /etc/origin/node/node.kubeconfig{,.old} 4. Restart node service. # systemctl restart atomic-openshift-node.service 5. Approve CSRs. 2 should be approved. # oc get csr -o name | xargs oc adm certificate approve
I believe there is a step missing in the workaround in comment #4: 6. Copy the bootstrap.kubeconfig created in step 1 to the following location in the master nodes: /etc/origin/master/bootstrap.kubeconfig We got hit by this bug and applied the workaround successfully, then tried to add new nodes to the cluster and they would'n join because the bootstrap.kubeconfig file that was being distributed ot the new nodes by the ansible playbooks was still the old one.
Build: openshift-ansible-3.11.106-1
@Ryan, thanks for those steps, appreciated. That helped us get back on track. @Jose - So did we, with the same with being on the masters too. Also there are a number of secrets that need to be regenerated. All the "default-tokenxxxx" in each project, majority of secret Type: kubernetes.io/tls, especially in the openshift-monitoring project. Then bounce impacted pods to pick up the new secret that is re-generated by the controller.
Verify this bug with openshift-ansible-3.11.117-1.git.0.add13ff.el7.noarch.rpm 1. Run playbooks/openshift-master/redeploy-openshift-ca.yml to redeploy OpenShift CA cert 2. With openshift_redeploy_openshift_ca=true added in inventory file, run playbooks/redeploy-certificates.yml to redeploy the OpenShift certs TASK [Remove generated certificates] ******************************************* changed: [ec2-52-90-122-145.compute-1.amazonaws.com] => (item=certificates) => {"changed": true, "item": "certificates", "path": "/etc/origin/node/certificates", "state": "absent"} changed: [ec2-3-81-234-184.compute-1.amazonaws.com] => (item=certificates) => {"changed": true, "item": "certificates", "path": "/etc/origin/node/certificates", "state": "absent"} changed: [ec2-3-90-244-172.compute-1.amazonaws.com] => (item=certificates) => {"changed": true, "item": "certificates", "path": "/etc/origin/node/certificates", "state": "absent"} changed: [ec2-52-90-122-145.compute-1.amazonaws.com] => (item=bootstrap.kubeconfig) => {"changed": true, "item": "bootstrap.kubeconfig", "path": "/etc/origin/node/bootstrap.kubeconfig", "state": "absent"} changed: [ec2-3-81-234-184.compute-1.amazonaws.com] => (item=bootstrap.kubeconfig) => {"changed": true, "item": "bootstrap.kubeconfig", "path": "/etc/origin/node/bootstrap.kubeconfig", "state": "absent"} changed: [ec2-3-90-244-172.compute-1.amazonaws.com] => (item=bootstrap.kubeconfig) => {"changed": true, "item": "bootstrap.kubeconfig", "path": "/etc/origin/node/bootstrap.kubeconfig", "state": "absent"} After redeployment, node certs and node bootstrap.kubeconfig get recreated. Node certs were issued by the new CA, bootstrap.kubeconfig get updated with new CA. All nodes are in Ready status and all pods are running, no pending csr, move this bug to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1605
*** Bug 1734552 has been marked as a duplicate of this bug. ***
I ran into this issue and managed to get my cluster in ready state by applying commits of these too PRs before running the redeploy certificates playbook: - https://github.com/openshift/openshift-ansible/pull/12235 - https://github.com/openshift/openshift-ansible/pull/12235/commits/6864ffe08de7c1d6a4caeeefe6182fa4d67817a0