Bug 1586010

Summary: Redeploy cert playbook fail at TASK [Wait for node to be ready]
Product: OpenShift Container Platform Reporter: Gaoyun Pei <gpei>
Component: InstallerAssignee: Scott Dodson <sdodson>
Status: CLOSED DEFERRED QA Contact: Gaoyun Pei <gpei>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.10.0CC: aos-bugs, jokerman, mmccomas, rteague
Target Milestone: ---   
Target Release: 3.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-19 20:54:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Gaoyun Pei 2018-06-05 10:06:35 UTC
Description of problem:
Run redeploy-certificates.yml playbook against an ocp-3.10 cluster, playbook may fail as below:

PLAY [Restart nodes] *******************************************************************************
...

TASK [Restart docker] *******************************************************************************************************************************************************
changed: [ec2-52-90-247-129.compute-1.amazonaws.com] => {"attempts": 1, "changed": true, "failed": false, "name": "docker", "state": "started", ...


TASK [Wait for master API to come back online] ******************************************************************************************************************************
ok: [ec2-52-90-247-129.compute-1.amazonaws.com] => {"changed": false, "elapsed": 25, "failed": false, "path": null, "port": 8443, "search_regex": null, "state": "started"}

TASK [restart node] *********************************************************************************************************************************************************
changed: [ec2-52-90-247-129.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "atomic-openshift-node", "state": "started", "status": ...

TASK [Wait for node to be ready] ********************************************************************************************************************************************
fatal: [ec2-52-90-247-129.compute-1.amazonaws.com]: FAILED! => {"failed": true, "msg": "The conditional check 'node_output.results.returncode == 0 and node_output.results.results[0].status.conditions | selectattr('type', 'match', '^Ready$') | map(attribute='status') | join | bool == True' failed. The error was: error while evaluating conditional (node_output.results.returncode == 0 and node_output.results.results[0].status.conditions | selectattr('type', 'match', '^Ready$') | map(attribute='status') | join | bool == True): 'dict object' has no attribute 'results'"}
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/redeploy-certificates.retry



Sometimes master-api service still not available when running "TASK [Wait for node to be ready] ", since docker restart will also trigger master-api/controllers service restart, so maybe we need "verify API server" task before restart node.


Version-Release number of the following components:
openshift-ansible-3.10.0-0.58.0.git.0.d8f6377.el7.noarch


How reproducible:
50% (3 failures in 6 attempts)

Steps to Reproduce:
1.Run openshift cert redeploy playbook
ansible-playbook -i host/310 -vvv /usr/share/ansible/openshift-ansible/playbooks/redeploy-certificates.yml


Actual results:
Playbook fails as in Description, then log into master, run "/usr/bin/oc get node ip-172-18-11-10.ec2.internal -o json -n default", it could work.


Expected results:


Additional info:
Ansible inventory file and full log with "-vvv" could be found in attachment

Comment 3 Scott Dodson 2018-06-05 15:13:50 UTC
The cert re-deploy playbooks are likely broken pretty badly. We'll try to address these in a 0-day.

Comment 5 Russell Teague 2018-11-19 20:54:09 UTC
There appear to be no active cases related to this bug. As such we're closing this bug in order to focus on bugs that are still tied to active customer cases. Please re-open this bug if you feel it was closed in error or a new active case is attached.