1586010 – Redeploy cert playbook fail at TASK [Wait for node to be ready]

Bug 1586010 - Redeploy cert playbook fail at TASK [Wait for node to be ready]

Summary: Redeploy cert playbook fail at TASK [Wait for node to be ready]

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.10.z
Assignee:	Scott Dodson
QA Contact:	Gaoyun Pei
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-05 10:06 UTC by Gaoyun Pei
Modified:	2018-11-19 20:54 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-19 20:54:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1585978	0	high	CLOSED	Redeploy openshift ca playbook failed	2021-02-22 00:41:40 UTC

Internal Links: 1585978

Description Gaoyun Pei 2018-06-05 10:06:35 UTC

Description of problem:
Run redeploy-certificates.yml playbook against an ocp-3.10 cluster, playbook may fail as below:

PLAY [Restart nodes] *******************************************************************************
...

TASK [Restart docker] *******************************************************************************************************************************************************
changed: [ec2-52-90-247-129.compute-1.amazonaws.com] => {"attempts": 1, "changed": true, "failed": false, "name": "docker", "state": "started", ...


TASK [Wait for master API to come back online] ******************************************************************************************************************************
ok: [ec2-52-90-247-129.compute-1.amazonaws.com] => {"changed": false, "elapsed": 25, "failed": false, "path": null, "port": 8443, "search_regex": null, "state": "started"}

TASK [restart node] *********************************************************************************************************************************************************
changed: [ec2-52-90-247-129.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "atomic-openshift-node", "state": "started", "status": ...

TASK [Wait for node to be ready] ********************************************************************************************************************************************
fatal: [ec2-52-90-247-129.compute-1.amazonaws.com]: FAILED! => {"failed": true, "msg": "The conditional check 'node_output.results.returncode == 0 and node_output.results.results[0].status.conditions | selectattr('type', 'match', '^Ready$') | map(attribute='status') | join | bool == True' failed. The error was: error while evaluating conditional (node_output.results.returncode == 0 and node_output.results.results[0].status.conditions | selectattr('type', 'match', '^Ready$') | map(attribute='status') | join | bool == True): 'dict object' has no attribute 'results'"}
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/redeploy-certificates.retry



Sometimes master-api service still not available when running "TASK [Wait for node to be ready] ", since docker restart will also trigger master-api/controllers service restart, so maybe we need "verify API server" task before restart node.


Version-Release number of the following components:
openshift-ansible-3.10.0-0.58.0.git.0.d8f6377.el7.noarch


How reproducible:
50% (3 failures in 6 attempts)

Steps to Reproduce:
1.Run openshift cert redeploy playbook
ansible-playbook -i host/310 -vvv /usr/share/ansible/openshift-ansible/playbooks/redeploy-certificates.yml


Actual results:
Playbook fails as in Description, then log into master, run "/usr/bin/oc get node ip-172-18-11-10.ec2.internal -o json -n default", it could work.


Expected results:


Additional info:
Ansible inventory file and full log with "-vvv" could be found in attachment

Comment 3 Scott Dodson 2018-06-05 15:13:50 UTC

The cert re-deploy playbooks are likely broken pretty badly. We'll try to address these in a 0-day.

Comment 5 Russell Teague 2018-11-19 20:54:09 UTC

There appear to be no active cases related to this bug. As such we're closing this bug in order to focus on bugs that are still tied to active customer cases. Please re-open this bug if you feel it was closed in error or a new active case is attached.

Note You need to log in before you can comment on or make changes to this bug.