Bug 1652746 - Running redeploy CA and openshift redeploy certificates , causes all nodes to become notready.
Summary: Running redeploy CA and openshift redeploy certificates , causes all nodes t...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 3.11.z
Assignee: Joseph Callen
QA Contact: Gaoyun Pei
: 1734552 (view as bug list)
Depends On:
Blocks: 1699467
TreeView+ depends on / blocked
Reported: 2018-11-22 19:41 UTC by Ryan Howe
Modified: 2022-03-13 16:11 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When a new CA was generated the certificates on the nodes were not updated. Consequence: Nodes would become not ready Fix: Add to the redeploy-certificates playbook existing playbooks that will bootstrap, copy certificates and join nodes. Result: Nodes no longer go NotReady when replacing CA.
Clone Of:
: 1699467 (view as bug list)
Last Closed: 2019-06-26 09:07:54 UTC
Target Upstream Version:

Attachments (Terms of Use)
beforerun.txt (1.89 KB, text/plain)
2018-11-22 19:41 UTC, Ryan Howe
no flags Details
after-master-redeploy-openshift-ca.txt (1.89 KB, text/plain)
2018-11-22 19:42 UTC, Ryan Howe
no flags Details
after-redeploy-cert.txt (1.89 KB, text/plain)
2018-11-22 19:42 UTC, Ryan Howe
no flags Details

System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3821042 0 Troubleshoot None How can I manually replace 3.10 or 3.11 node certificates 2019-03-06 18:53:44 UTC
Red Hat Product Errata RHBA-2019:1605 0 None None None 2019-06-26 09:08:02 UTC

Internal Links: 1757695

Description Ryan Howe 2018-11-22 19:41:13 UTC
Created attachment 1508098 [details]

Description of problem:

After redeploying new CA and running redeploy-certificates, all nodes become notready due to certificates not getting rotated and bootstrap.kubeconfig not getting updated. 

Version-Release number of the following components:

How reproducible:

Steps to Reproduce:
1. ansible-playbook -i <inventory_file> /usr/share/ansible/openshift-ansible/playbooks/openshift-master/redeploy-openshift-ca.yml

2.  ansible-playbook -i <inventory_file> /usr/share/ansible/openshift-ansible/playbooks/redeploy-certificates.yml

Actual results:
  ca.crt gets updated and distributed to cluster. 
  master certs are updated. 

  node certs are not recreated 
  node bootstrap.kubeconfig in not updated with new and client certs. 

Expected results:
 node certs to be recreated 
 node bootstrap.kubeconfig to get updated with new CA and client certs

Additional info:

List certs  
# bash <(https://gist.githubusercontent.com/rjhowe/686e3729c2176b446f55dd6d6f1cdc22/raw/c24a774d7acd5cfce2895dd5fa246947b4c1b039/certcheck.sh) 

Before running any playbooks: 

After CA redeploy playbook: 

After redeploy certs playbook 

Comment 1 Ryan Howe 2018-11-22 19:42:14 UTC
Created attachment 1508099 [details]

Comment 2 Ryan Howe 2018-11-22 19:42:44 UTC
Created attachment 1508100 [details]

Comment 3 Ryan Howe 2018-11-23 22:38:26 UTC
To fix this all that needs to happen is to create a new bootstrap.kubeconfig ,  disturbed  them, remove old certs, restart nodes, and approve the CSRs. 

1. Create a new bootstrap.kubeconfig 
# oc serviceaccounts create-kubeconfig node-bootstrapper -n openshift-infra > /etc/origin/node/bootstrap.kubeconfig

2. Distribute it to nodes replacing /etc/origin/node/bootstrap.kubeconfig

3. Remove contents of /etc/origin/node/certificates
# rm -rf  /etc/origin/node/certificates

4. Restart node service 

5. Approve CSRs
# oc get csr -o name | xargs oc adm certificate approve

Example fix maybe: 

- add this play to playbooks/redeploy-certificates.yml
 - playbooks/openshift-master/private/tasks/enable_bootstrap.yml

- then a new play:

- name: Remove old node certificates
  hosts: oo_nodes_to_config
  gather_facts: node
  - file:
    path: /etc/origin/node/certificates
    state: absent  

- name: Distribute bootstrap and start nodes
  hosts: oo_nodes_to_config
  gather_facts: no
  - import_role:
      name: openshift_node
      tasks_from: distribute_bootstrap.yml

- name: Approve any pending CSR requests from inventory nodes
  hosts: oo_first_master
  gather_facts: no
  - name: Dump all candidate bootstrap hostnames
      msg: "{{ groups['oo_nodes_to_config'] | default([]) }}"

  - name: Find all hostnames for bootstrapping
      l_nodes_to_join: "{{ groups['oo_nodes_to_config'] | default([]) | map('extract', hostvars) | map(attribute='l_kubelet_node_name') | list }}"

  - name: Dump the bootstrap hostnames
      msg: "{{ l_nodes_to_join }}"

  - name: Approve node certificates when bootstrapping
      oc_bin: "{{ openshift_client_binary }}"
      oc_conf: "{{ openshift.common.config_base }}/master/admin.kubeconfig"
      node_list: "{{ l_nodes_to_join }}"
    register: node_bootstrap_csr_approve
    retries: 30
    until: node_bootstrap_csr_approve is succeeded
    - l_nodes_to_join|length > 0

Comment 4 Ryan Howe 2018-12-28 14:49:33 UTC
UPDATED MANUAL STEPS to replace certs on NODE. 

1. Create a new bootstrap.kubeconfig for nodes (MASTER nodes will just copy admin.kubeconfig)
# oc serviceaccounts create-kubeconfig node-bootstrapper -n openshift-infra > bootstrap.kubeconfig

# cp /etc/origin/master/admin.kubeconfig /etc/origin/node/bootstrap.kubeconfig

2. Distribute configure create in step 1A to infra and compute nodes replacing /etc/origin/node/bootstrap.kubeconfig

3A. Remove contents of /etc/origin/node/certificates and move the node.kubeconfig and client-ca.crt 
# rm -rf  /etc/origin/node/certificates
# mv /etc/origin/node/client-ca.crt{,.old}
# mv /etc/origin/node/node.kubeconfig{,.old}

4. Restart node service.
# systemctl restart atomic-openshift-node.service 

5. Approve CSRs. 2 should be approved.  
# oc get csr -o name | xargs oc adm certificate approve

Comment 11 Jose Ignacio Jerez 2019-04-15 08:24:49 UTC
I believe there is a step missing in the workaround in comment #4:

6. Copy the bootstrap.kubeconfig created in step 1 to the following location in the master nodes: /etc/origin/master/bootstrap.kubeconfig

We got hit by this bug and applied the workaround successfully, then tried to add new nodes to the cluster and they would'n join because the bootstrap.kubeconfig file that was being distributed ot the new nodes by the ansible playbooks was still the old one.

Comment 12 Joseph Callen 2019-04-15 13:05:56 UTC
Build: openshift-ansible-3.11.106-1

Comment 13 cg 2019-04-16 22:36:20 UTC
@Ryan, thanks for those steps, appreciated. That helped us get back on track.

@Jose - So did we, with the same with being on the masters too.

Also there are a number of secrets that need to be regenerated. All the "default-tokenxxxx" in each project, majority of secret Type: kubernetes.io/tls, especially in the openshift-monitoring project. 
Then bounce impacted pods to pick up the new secret that is re-generated by the controller.

Comment 15 Gaoyun Pei 2019-06-13 07:43:17 UTC
Verify this bug with openshift-ansible-3.11.117-1.git.0.add13ff.el7.noarch.rpm

1. Run playbooks/openshift-master/redeploy-openshift-ca.yml to redeploy OpenShift CA cert
2. With openshift_redeploy_openshift_ca=true added in inventory file, run playbooks/redeploy-certificates.yml to redeploy the OpenShift certs

TASK [Remove generated certificates] *******************************************
changed: [ec2-52-90-122-145.compute-1.amazonaws.com] => (item=certificates) => {"changed": true, "item": "certificates", "path": "/etc/origin/node/certificates", "state": "absent"}
changed: [ec2-3-81-234-184.compute-1.amazonaws.com] => (item=certificates) => {"changed": true, "item": "certificates", "path": "/etc/origin/node/certificates", "state": "absent"}
changed: [ec2-3-90-244-172.compute-1.amazonaws.com] => (item=certificates) => {"changed": true, "item": "certificates", "path": "/etc/origin/node/certificates", "state": "absent"}
changed: [ec2-52-90-122-145.compute-1.amazonaws.com] => (item=bootstrap.kubeconfig) => {"changed": true, "item": "bootstrap.kubeconfig", "path": "/etc/origin/node/bootstrap.kubeconfig", "state": "absent"}
changed: [ec2-3-81-234-184.compute-1.amazonaws.com] => (item=bootstrap.kubeconfig) => {"changed": true, "item": "bootstrap.kubeconfig", "path": "/etc/origin/node/bootstrap.kubeconfig", "state": "absent"}
changed: [ec2-3-90-244-172.compute-1.amazonaws.com] => (item=bootstrap.kubeconfig) => {"changed": true, "item": "bootstrap.kubeconfig", "path": "/etc/origin/node/bootstrap.kubeconfig", "state": "absent"}

After redeployment, node certs and node bootstrap.kubeconfig get recreated.
Node certs were issued by the new CA, bootstrap.kubeconfig get updated with new CA.

All nodes are in Ready status and all pods are running, no pending csr, move this bug to VERIFIED.

Comment 17 errata-xmlrpc 2019-06-26 09:07:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Comment 18 Ryan Howe 2019-08-02 15:57:44 UTC
*** Bug 1734552 has been marked as a duplicate of this bug. ***

Comment 19 Omar BENHAMID 2020-11-17 17:30:44 UTC
I ran into this issue and managed to get my cluster in ready state by applying commits of these too PRs before running the redeploy certificates playbook:

- https://github.com/openshift/openshift-ansible/pull/12235
- https://github.com/openshift/openshift-ansible/pull/12235/commits/6864ffe08de7c1d6a4caeeefe6182fa4d67817a0

Note You need to log in before you can comment on or make changes to this bug.