1652746 – Running redeploy CA and openshift redeploy certificates , causes all nodes to become notready.

Bug 1652746 - Running redeploy CA and openshift redeploy certificates , causes all nodes to become notready.

Summary: Running redeploy CA and openshift redeploy certificates , causes all nodes t...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Joseph Callen
QA Contact:	Gaoyun Pei
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1734552 (view as bug list)
Depends On:
Blocks:	1699467
TreeView+	depends on / blocked

Reported:	2018-11-22 19:41 UTC by Ryan Howe
Modified:	2023-03-24 14:23 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: When a new CA was generated the certificates on the nodes were not updated. Consequence: Nodes would become not ready Fix: Add to the redeploy-certificates playbook existing playbooks that will bootstrap, copy certificates and join nodes. Result: Nodes no longer go NotReady when replacing CA.
Clone Of:
Clones:	1699467 (view as bug list)
Environment:
Last Closed:	2019-06-26 09:07:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
beforerun.txt (1.89 KB, text/plain) 2018-11-22 19:41 UTC, Ryan Howe	no flags	Details
after-master-redeploy-openshift-ca.txt (1.89 KB, text/plain) 2018-11-22 19:42 UTC, Ryan Howe	no flags	Details
after-redeploy-cert.txt (1.89 KB, text/plain) 2018-11-22 19:42 UTC, Ryan Howe	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3821042	0	Troubleshoot	None	How can I manually replace 3.10 or 3.11 node certificates	2019-03-06 18:53:44 UTC
Red Hat Product Errata	RHBA-2019:1605	0	None	None	None	2019-06-26 09:08:02 UTC

Internal Links: 1757695

Description Ryan Howe 2018-11-22 19:41:13 UTC

Created attachment 1508098 [details]
beforerun.txt

Description of problem:

After redeploying new CA and running redeploy-certificates, all nodes become notready due to certificates not getting rotated and bootstrap.kubeconfig not getting updated. 


Version-Release number of the following components:
 openshift-ansible-3.10.66 
 openshift-ansible-3.11.16
 

How reproducible:

Steps to Reproduce:
1. ansible-playbook -i <inventory_file> /usr/share/ansible/openshift-ansible/playbooks/openshift-master/redeploy-openshift-ca.yml

2.  ansible-playbook -i <inventory_file> /usr/share/ansible/openshift-ansible/playbooks/redeploy-certificates.yml


Actual results:
  ca.crt gets updated and distributed to cluster. 
  master certs are updated. 

  node certs are not recreated 
  node bootstrap.kubeconfig in not updated with new and client certs. 

Expected results:
 node certs to be recreated 
 node bootstrap.kubeconfig to get updated with new CA and client certs

Additional info:

List certs  
# bash <(https://gist.githubusercontent.com/rjhowe/686e3729c2176b446f55dd6d6f1cdc22/raw/c24a774d7acd5cfce2895dd5fa246947b4c1b039/certcheck.sh) 

Before running any playbooks: 
 beforerun.txt 

After CA redeploy playbook: 
 after-master-redeploy-openshift-ca.txt

After redeploy certs playbook 
 after-redeploy-cert.txt

Comment 1 Ryan Howe 2018-11-22 19:42:14 UTC

Created attachment 1508099 [details]
after-master-redeploy-openshift-ca.txt

Comment 2 Ryan Howe 2018-11-22 19:42:44 UTC

Created attachment 1508100 [details]
after-redeploy-cert.txt

Comment 3 Ryan Howe 2018-11-23 22:38:26 UTC

To fix this all that needs to happen is to create a new bootstrap.kubeconfig ,  disturbed  them, remove old certs, restart nodes, and approve the CSRs. 

1. Create a new bootstrap.kubeconfig 
# oc serviceaccounts create-kubeconfig node-bootstrapper -n openshift-infra > /etc/origin/node/bootstrap.kubeconfig

2. Distribute it to nodes replacing /etc/origin/node/bootstrap.kubeconfig


3. Remove contents of /etc/origin/node/certificates
# rm -rf  /etc/origin/node/certificates


4. Restart node service 

5. Approve CSRs
# oc get csr -o name | xargs oc adm certificate approve


---------------------------------------------------
---------------------------------------------------
Example fix maybe: 

- add this play to playbooks/redeploy-certificates.yml
```
 - playbooks/openshift-master/private/tasks/enable_bootstrap.yml

 - NAME_OF_NEW_PLAY
 
```
 
- then a new play:

```
- name: Remove old node certificates
  hosts: oo_nodes_to_config
  gather_facts: node
  tasks: 
  - file:
    path: /etc/origin/node/certificates
    state: absent  

- name: Distribute bootstrap and start nodes
  hosts: oo_nodes_to_config
  gather_facts: no
  tasks:
  - import_role:
      name: openshift_node
      tasks_from: distribute_bootstrap.yml

- name: Approve any pending CSR requests from inventory nodes
  hosts: oo_first_master
  gather_facts: no
  tasks:
  - name: Dump all candidate bootstrap hostnames
    debug:
      msg: "{{ groups['oo_nodes_to_config'] | default([]) }}"

  - name: Find all hostnames for bootstrapping
    set_fact:
      l_nodes_to_join: "{{ groups['oo_nodes_to_config'] | default([]) | map('extract', hostvars) | map(attribute='l_kubelet_node_name') | list }}"

  - name: Dump the bootstrap hostnames
    debug:
      msg: "{{ l_nodes_to_join }}"


  - name: Approve node certificates when bootstrapping
    oc_csr_approve:
      oc_bin: "{{ openshift_client_binary }}"
      oc_conf: "{{ openshift.common.config_base }}/master/admin.kubeconfig"
      node_list: "{{ l_nodes_to_join }}"
    register: node_bootstrap_csr_approve
    retries: 30
    until: node_bootstrap_csr_approve is succeeded
    when:
    - l_nodes_to_join|length > 0
```

Comment 4 Ryan Howe 2018-12-28 14:49:33 UTC

UPDATED MANUAL STEPS to replace certs on NODE. 

1. Create a new bootstrap.kubeconfig for nodes (MASTER nodes will just copy admin.kubeconfig)
# oc serviceaccounts create-kubeconfig node-bootstrapper -n openshift-infra > bootstrap.kubeconfig

1B. JUST ON THE MASTERS 
# cp /etc/origin/master/admin.kubeconfig /etc/origin/node/bootstrap.kubeconfig

2. Distribute configure create in step 1A to infra and compute nodes replacing /etc/origin/node/bootstrap.kubeconfig

3A. Remove contents of /etc/origin/node/certificates and move the node.kubeconfig and client-ca.crt 
# rm -rf  /etc/origin/node/certificates
# mv /etc/origin/node/client-ca.crt{,.old}
# mv /etc/origin/node/node.kubeconfig{,.old}

4. Restart node service.
# systemctl restart atomic-openshift-node.service 

5. Approve CSRs. 2 should be approved.  
# oc get csr -o name | xargs oc adm certificate approve

Comment 11 Jose Ignacio Jerez 2019-04-15 08:24:49 UTC

I believe there is a step missing in the workaround in comment #4:

6. Copy the bootstrap.kubeconfig created in step 1 to the following location in the master nodes: /etc/origin/master/bootstrap.kubeconfig


We got hit by this bug and applied the workaround successfully, then tried to add new nodes to the cluster and they would'n join because the bootstrap.kubeconfig file that was being distributed ot the new nodes by the ansible playbooks was still the old one.

Comment 12 Joseph Callen 2019-04-15 13:05:56 UTC

Build: openshift-ansible-3.11.106-1

Comment 13 cg 2019-04-16 22:36:20 UTC

@Ryan, thanks for those steps, appreciated. That helped us get back on track.

@Jose - So did we, with the same with being on the masters too.

Also there are a number of secrets that need to be regenerated. All the "default-tokenxxxx" in each project, majority of secret Type: kubernetes.io/tls, especially in the openshift-monitoring project. 
Then bounce impacted pods to pick up the new secret that is re-generated by the controller.

Comment 15 Gaoyun Pei 2019-06-13 07:43:17 UTC

Verify this bug with openshift-ansible-3.11.117-1.git.0.add13ff.el7.noarch.rpm


1. Run playbooks/openshift-master/redeploy-openshift-ca.yml to redeploy OpenShift CA cert
2. With openshift_redeploy_openshift_ca=true added in inventory file, run playbooks/redeploy-certificates.yml to redeploy the OpenShift certs

TASK [Remove generated certificates] *******************************************
changed: [ec2-52-90-122-145.compute-1.amazonaws.com] => (item=certificates) => {"changed": true, "item": "certificates", "path": "/etc/origin/node/certificates", "state": "absent"}
changed: [ec2-3-81-234-184.compute-1.amazonaws.com] => (item=certificates) => {"changed": true, "item": "certificates", "path": "/etc/origin/node/certificates", "state": "absent"}
changed: [ec2-3-90-244-172.compute-1.amazonaws.com] => (item=certificates) => {"changed": true, "item": "certificates", "path": "/etc/origin/node/certificates", "state": "absent"}
changed: [ec2-52-90-122-145.compute-1.amazonaws.com] => (item=bootstrap.kubeconfig) => {"changed": true, "item": "bootstrap.kubeconfig", "path": "/etc/origin/node/bootstrap.kubeconfig", "state": "absent"}
changed: [ec2-3-81-234-184.compute-1.amazonaws.com] => (item=bootstrap.kubeconfig) => {"changed": true, "item": "bootstrap.kubeconfig", "path": "/etc/origin/node/bootstrap.kubeconfig", "state": "absent"}
changed: [ec2-3-90-244-172.compute-1.amazonaws.com] => (item=bootstrap.kubeconfig) => {"changed": true, "item": "bootstrap.kubeconfig", "path": "/etc/origin/node/bootstrap.kubeconfig", "state": "absent"}

After redeployment, node certs and node bootstrap.kubeconfig get recreated.
Node certs were issued by the new CA, bootstrap.kubeconfig get updated with new CA.

All nodes are in Ready status and all pods are running, no pending csr, move this bug to VERIFIED.

Comment 17 errata-xmlrpc 2019-06-26 09:07:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1605

Comment 18 Ryan Howe 2019-08-02 15:57:44 UTC

*** Bug 1734552 has been marked as a duplicate of this bug. ***

Comment 19 Omar BENHAMID 2020-11-17 17:30:44 UTC

I ran into this issue and managed to get my cluster in ready state by applying commits of these too PRs before running the redeploy certificates playbook:

- https://github.com/openshift/openshift-ansible/pull/12235
- https://github.com/openshift/openshift-ansible/pull/12235/commits/6864ffe08de7c1d6a4caeeefe6182fa4d67817a0

Note You need to log in before you can comment on or make changes to this bug.