1837123 – redeploy-certificates.yaml did not update certificates properly

Bug 1837123 - redeploy-certificates.yaml did not update certificates properly

Summary: redeploy-certificates.yaml did not update certificates properly

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Russell Teague
QA Contact:	Gaoyun Pei
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-18 21:26 UTC by Brandon Smitley
Modified:	2023-12-15 17:56 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Adds a check of the master-config.yaml to determine if the client.CA has been reverted. If not, the play will fail indicating openshift_redeploy_openshift_ca=true must be set in the inventory. This check will prevent inadvertant certificate redeploy when the OpenShift CA has been updated and not rolled out.
Clone Of:
Environment:
Last Closed:	2020-10-22 11:02:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openshift-ansible pull 12238	0	None	closed	Bug 1837123: Detect an incomplete OpenShift CA redeployment	2021-02-19 23:07:31 UTC
Red Hat Product Errata	RHBA-2020:4170	0	None	None	None	2020-10-22 11:02:46 UTC

Description Brandon Smitley 2020-05-18 21:26:59 UTC

Description of problem:
PROBLEM: 
Ran the 'ansible-playbook -v /usr/share/ansible/openshift-ansible/playbooks/redeploy-certificates.yaml' and the webconsole, console, and logging components are all complaining of "x509: certificate signed by unknown authority".



Version-Release number of selected component (if applicable):
OpenShift 3.11

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
==========

>1)The Customer started running the playbook the first time at 10:00CDT and it failed in ~40min. At this time all 3 masters were in a 'NotReady' state. the customer reran the same playbook (2nd run) hoping it would restore them to service, but they did not return to normal. The customer was able to get them back online by bootstrapping them per the same document Custet stated earlier. The customer then restarted the playbook which continued to run.
===========

>2) Pods hosted on the 3 masters, logging-fluentd, webconsole, and console, all would not come online with 'x509: certificate signed by unknown authority' errors in the events. This started with the first run of the playbook and continued all the way through the 3rd unsuccessful run of the playbook. The customer was able to resolve this by restarting the ovs-* pod in the openshift-sdn namespace on the masters. After restarting the ovs pods on the masters, the other pods started to come online.
============

>3) The 3rd run of the playbook almost completed but issues arose when towards the end when the openshift-web-console failed to start up. The customer noticed the routers pods were also stuck rolling out with their new certificates as well. 2 new pods had started but none of others were proceeding. At this time I cancelled the rollout of the new router as users started complaining of issues with their applications. At this time, all nodes in the cluster started reporting 'NotReady'.

fatal: [1002apfrp00021.optumfe.com]: FAILED! => {"attempts": 60, "changed": false, "module_results": {"cmd": "/usr/bin/oc get deployment webconsole -o json -n openshift-web-console", "results":

  "observedGeneration": 3, "replicas": 3, "unavailableReplicas": 3, "updatedReplicas": 3}}]


Expected results:
For the playbook to run successfully and the certificates to be updated. 

Additional info:

Comment 3 Standa Laznicka 2020-05-19 12:13:42 UTC

Moving to the ansible team, I do not know what the playbook actually does.

Comment 10 Russell Teague 2020-07-10 18:44:25 UTC

To be reviewed as part of https://issues.redhat.com/browse/CORS-1470

Comment 11 Russell Teague 2020-07-20 18:38:52 UTC

Jira issue https://issues.redhat.com/browse/CORS-1470 was not scheduled for the current sprint.

Comment 15 Gaoyun Pei 2020-09-29 14:14:28 UTC

Verify this bug with openshift-ansible-3.11.299-1.git.0.2dfaf92.el7.noarch.rpm.

1. Redeploy openshift CA
ansible-playbook openshift-ansible/playbooks/openshift-master/redeploy-openshift-ca.yml -v

2. Redeploy openshift certificates
ansible-playbook openshift-ansible/playbooks/redeploy-certificates.yml -v


09-29 22:11:48  TASK [Check servingInfo.clientCA = ca.crt in master config] ********************
09-29 22:11:48  fatal: [ec2-52-90-69-73.compute-1.amazonaws.com]: FAILED! => {"changed": false, "msg": "Detected an incomplete OpenShift CA redeployment.  Please set openshift_redeploy_openshift_ca=true in the inventory and re-run redeploy-certifcates.yml\n"}

Comment 18 errata-xmlrpc 2020-10-22 11:02:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 3.11.306 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4170

Note You need to log in before you can comment on or make changes to this bug.