Bug 1645323

Summary: Director deployed OCP 3.11: scaling out with an additional master node fails during TASK [openshift_control_plane : Wait for all control plane pods to become ready]
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Martin André <m.andre>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 14.0 (Rocky)CC: athomas, dbecker, m.andre, mburns, morazi, sclewis
Target Milestone: rcKeywords: Triaged
Target Release: 14.0 (Rocky)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-9.0.1-0.20181013060891.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-11 11:54:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
openshift.tar.gz none

Description Marius Cornea 2018-11-01 23:43:33 UTC
Created attachment 1500279 [details]
openshift.tar.gz

Description of problem:
Director deployed OCP 3.11: scaling out with an additional master node fails during TASK [openshift_control_plane : Wait for all control plane pods to become ready]:

TASK [openshift_control_plane : Wait for all control plane pods to become ready] ***
FAILED - RETRYING: Wait for all control plane pods to become ready (2 retries left).
FAILED - RETRYING: Wait for all control plane pods to become ready (1 retries left).
failed: [openshift-master-3] (item=etcd) => {"attempts": 60, "changed": false, "item": "etcd", "results": {"cmd": "/bin/oc get pod master-etcd-openshift-master-3 -o json -n kube-system", "results": [{}], "returncode": 0, "stderr": "Error from server (NotFound): pods \"master-etcd-openshift-master-3\" not found\n", "stdout": ""}, "state": "list"}
ok: [openshift-master-3] => (item=api)
ok: [openshift-master-3] => (item=controllers)

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
localhost                  : ok=36   changed=0    unreachable=0    failed=0   
openshift-infra-0          : ok=26   changed=5    unreachable=0    failed=0   
openshift-infra-1          : ok=26   changed=5    unreachable=0    failed=0   
openshift-master-0         : ok=52   changed=7    unreachable=0    failed=0   
openshift-master-1         : ok=52   changed=7    unreachable=0    failed=0   
openshift-master-2         : ok=92   changed=7    unreachable=0    failed=0   
openshift-master-3         : ok=321  changed=126  unreachable=0    failed=1   
openshift-worker-0         : ok=26   changed=5    unreachable=0    failed=0   
openshift-worker-1         : ok=26   changed=5    unreachable=0    failed=0   


INSTALLER STATUS ***************************************************************
Initialization              : Complete (0:01:57)
Node Bootstrap Preparation  : Complete (0:03:45)
Master Install              : In Progress (0:08:52)
	This phase can be restarted by running: playbooks/openshift-master/config.yml


Failure summary:


  1. Hosts:    openshift-master-3
     Play:     Configure masters
     Task:     Wait for all control plane pods to become ready
     Message:  All items completed


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-9.0.1-0.20181013060867.ffbe879.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy environment with 3 x masters + 2 x infra + 2 x worker nodes
2. Add an additional master node and re-run overcloud deploy command

Actual results:
Deployment fails.

Expected results:
No failures.

Additional info:
Attaching /var/lib/mistral.

Comment 1 Martin André 2018-11-07 08:15:27 UTC
I've tried to reproduce this issue twice, and both times it failed earlier for me with a different error:

TASK [etcd : Ensure CA certificate exists on etcd_ca_host] *********************
ok: [openshift-openshiftmaster-1 -> 192.168.24.24]

TASK [etcd : fail] *************************************************************
fatal: [openshift-openshiftmaster-1]: FAILED! => {"changed": false, "msg": "CA certificate /etc/etcd/ca/ca.crt doesn't exist on CA host openshift-openshiftmaster-1. Apply 'etcd_ca' action from `etcd` role to openshift-openshiftmaster-1.\n"}

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
localhost                  : ok=39   changed=0    unreachable=0    failed=0   
openshift-openshiftinfra-0 : ok=27   changed=5    unreachable=0    failed=0   
openshift-openshiftinfra-1 : ok=27   changed=5    unreachable=0    failed=0   
openshift-openshiftinfra-2 : ok=27   changed=5    unreachable=0    failed=0   
openshift-openshiftmaster-0 : ok=53   changed=7    unreachable=0    failed=0   
openshift-openshiftmaster-1 : ok=242  changed=71   unreachable=0    failed=1   
openshift-openshiftworker-0 : ok=27   changed=5    unreachable=0    failed=0   
openshift-openshiftworker-1 : ok=27   changed=5    unreachable=0    failed=0   
openshift-openshiftworker-2 : ok=27   changed=5    unreachable=0    failed=0   


INSTALLER STATUS ***************************************************************
Initialization              : Complete (0:01:14)
Node Bootstrap Preparation  : Complete (0:04:51)


Failure summary:


  1. Hosts:    openshift-openshiftmaster-1
     Play:     Create etcd client certificates for master hosts
     Task:     etcd : fail
     Message:  CA certificate /etc/etcd/ca/ca.crt doesn't exist on CA host openshift-openshiftmaster-1. Apply 'etcd_ca' action from `etcd` role to openshift-openshiftmaster-1.

Comment 3 Martin André 2018-11-21 13:01:49 UTC
The upstream patch at https://review.openstack.org/616584 should fix the issue.

Comment 10 Martin André 2019-01-10 10:19:37 UTC
No doc text required.

Comment 11 errata-xmlrpc 2019-01-11 11:54:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045