Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1628233

Summary: openshift-ansible release-3.10 deployment fails on "Wait for all control plane pods to become ready"
Product: OpenShift Container Platform Reporter: Sergii Marynenko <marynenko>
Component: InstallerAssignee: Scott Dodson <sdodson>
Status: CLOSED NOTABUG QA Contact: Johnny Liu <jialiu>
Severity: medium Docs Contact:
Priority: low    
Version: 3.10.0CC: aos-bugs, jokerman, mark.vinkx, marynenko, mmccomas, shlao, vrutkovs
Target Milestone: ---   
Target Release: 3.10.z   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-04 16:35:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ansible log file
none
inventory file none

Description Sergii Marynenko 2018-09-12 13:48:36 UTC
Created attachment 1482695 [details]
ansible log file

Description of problem:

/tmp/openshift is created with:
git clone --depth=1 -b release-3.10 https://github.com/openshift/openshift-ansible.git /tmp/openshift

Executing
ansible-playbook -i inventory/openshift-inventory.ini /tmp/openshift/playbooks/deploy_cluster.yml -vvv

Failed with:
"stderr": "Error from server (NotFound): pods \"master-controllers-testosmaster1.xxxxx.xxxxxxx.com\" not found\n",
 
(Real domain is masked with xxxxx.xxxxxxx.com)

Version-Release number of the following components:
rpm -q openshift-ansible
package openshift-ansible is not installed

rpm -q ansible
ansible-2.6.3-1.el7.noarch

ansible --version
ansible 2.6.3
  config file = /home/MUENCHEN/smarynenko/work/test/openshift/ansible.cfg
  configured module search path = [u'/home/MUENCHEN/smarynenko/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Jul 13 2018, 13:06:57) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

Linux testosmaster1 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"


How reproducible: 

Steps to Reproduce:
1. Create virtual machines on vSphere - 3 masters, 3 nodes, 2 infras
2. In openshift-inventory.ini change "xxxxx.xxxxxxx.com" to your desired domain and run the installer of openshift, type - origin with, branch - release-3.10: 
ansible-playbook -i inventory/openshift-inventory.ini /tmp/openshift/playbooks/deploy_cluster.yml 
3. Observe error message in the ansible log.

Actual results:
  1. Hosts:    testosmaster1.xxxxx.xxxxxxx.com, testosmaster2.xxxxx.xxxxxxx.com, testosmaster3.xxxxx.xxxxxxx.com
     Play:     Configure masters
     Task:     Wait for all control plane pods to become ready
     Message:  [0;31mAll items completed[0m

Expected results:
Installation finished successfully

Comment 1 Sergii Marynenko 2018-09-12 13:55:17 UTC
Workaround is to add "ignore_errors: true" to the TASK "Wait for all control plane pods to become ready"
in 
/tmp/openshift/roles/openshift_control_plane/tasks/main.yml

Comment 2 Sergii Marynenko 2018-09-12 13:58:50 UTC
NotFound pods like "master-controllers-testosmaster1.xxxxx.xxxxxxx.com"
are actually not fount, there are pods without domain part in the names like:
oc get pods --all-namespaces'
NAMESPACE     NAME                               READY     STATUS    RESTARTS   AGE
kube-system   master-api-testosmaster1           1/1       Running   0          4h
kube-system   master-api-testosmaster2           1/1       Running   0          4h
kube-system   master-api-testosmaster3           1/1       Running   0          4h
kube-system   master-controllers-testosmaster1   1/1       Running   0          4h
kube-system   master-controllers-testosmaster2   1/1       Running   0          4h
kube-system   master-controllers-testosmaster3   1/1       Running   0          4h
kube-system   master-etcd-testosmaster1          1/1       Running   0          4h
kube-system   master-etcd-testosmaster2          1/1       Running   0          4h
kube-system   master-etcd-testosmaster3          1/1       Running   0          4h

Comment 3 Vadim Rutkovsky 2018-09-12 14:27:01 UTC
(In reply to smarynenko from comment #2)
> NotFound pods like "master-controllers-testosmaster1.xxxxx.xxxxxxx.com"
> are actually not fount, there are pods without domain part in the names like:
> oc get pods --all-namespaces'

openshift-ansible uses openshift.node.nodename to predict pod names, which is generated from `hostname -f` output on the nodes. The kubelet also reads this value and would name pods accordingly.

It seems "raw_hostname" is set to "testosmaster1", but other nodenames are set to FQDN.

openshift-ansible would use whatever FQDN specified (since there's no cloudprovider here).


Could you verify FQDN on the host AND your DNS server to return the same value (be it short or long, but pick one)?

Scott, it seems `raw_hostname` should be used in any case, as kubelet picks it most of the times

Comment 4 Sergii Marynenko 2018-09-12 15:44:41 UTC
>Could you verify FQDN on the host AND your DNS server to return the same value
>(be it short or long, but pick one)?

"hostname -f" on all of the nodes gives FQDN 
(A FQDN consists of a short host name and the DNS domain name)
like:
testosmaster1.xxxxx.xxxxxxx.com

btw "man hostname" says:
-f, --fqdn, --long
Display the FQDN (Fully Qualified Domain Name). A FQDN consists of a short  host  name  and the  DNS  domain name. Unless you are using bind or NIS for host lookups you can change the FQDN and the DNS domain name (which is part of the FQDN) in the /etc/hosts  file.  See  the warnings  in  section THE FQDN above, and avoid using this option; use hostname --all-fqdns instead.

So:
[root@testosmaster1 ~]# hostname --all-fqdns
testosmaster1.xxxxx.xxxxxxx.com testosmaster1

DNS server has A records in the direct zone xxxxx.xxxxxxx.com for all nodes:
testosmaster1.xxxxx.xxxxxxx.com. 3600 IN A 172.16.25.205
testosmaster2.xxxxx.xxxxxxx.com. 3585 IN A 172.16.25.206
testosmaster3.xxxxx.xxxxxxx.com. 3600 IN A 172.16.25.207

So DNS answer differs by "." 

The DNS server isn't configured to serve reverse lookups for those names as it is not required in the documentation.

Comment 5 Sergii Marynenko 2018-09-12 15:53:40 UTC
/etc/hosts on each node contains only two raws, 
for instance on testosmaster1:
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
172.16.25.205   testosmaster1.xxxxx.xxxxxxx.com testosmaster1

each node as a second raw has its own 'IP FQDN hostname'

Comment 6 Sergii Marynenko 2018-09-12 16:00:45 UTC
Created attachment 1482757 [details]
inventory file

Comment 7 sheng.lao 2018-09-27 07:48:29 UTC
This problem maybe case by the /etc/resolv.conf. If existed string looks like:
search  xxx.yyy.zzz.com

then command : hostname -f, will output 'testosmaster2.xxx.yyy.zzz.com'

I am checking which process modify file '/etc/resolv.conf'

Comment 8 Sergii Marynenko 2018-10-04 16:35:16 UTC
Solved by removing line with FQDN from /etc/hosts file.
After terraform managed VM creating in vSphere each VM has a line with external IP and FQDN in /etc/hosts. The line removing eliminates the issue.